New publications:
_________________________________________________________________________
New Corpora
(1) ARL Arabic Dependency Treebank was
developed by the US Army Research Laboratory (ARL)
and was derived from four LDC resources: Arabic Treebank (ATB) Part 1 v4.1 (LDC2010T13), Part 2 v3.1 (LDC2011T09), Part 3 v3.2 (LDC2010T08) and Broadcast News v1.0 (LDC2012T07).
LDC's ATB series follows the constituency or phrase structure approach to treebank development in which clauses are divided into noun phrases and verb phrases and in each sentence, one or more nodes may correspond to one element. Dependency grammar, on the other hand, is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. This is a one-to-one correspondence: for every element in the sentence there is one node in the sentence structure that corresponds to that element. ARL Arabic Dependency Treebank was generated using constituency-to-dependency software written at ARL.
The source data in this release consists of Arabic newswire and broadcast programming collected by LDC from various news and broadcast providers.
The files are in an 11-column tab-separated format with one or more blank lines between sentences. All files are UTF-8 encoded.
ARL Arabic Dependency Treebank
is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) BOLT Chinese-English Word Alignment and
Tagging -- Discussion Forum Training was
developed by LDC and consists of 448,094 words of Chinese and English parallel
text enhanced with linguistic tags to indicate word relations.
BOLT
Chinese-English Word Alignment and Tagging -- Discussion Forum Training
is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.
*
(3) IARPA Babel Pashto Language Pack
IARPA-babel104b-v0.4bY was developed by Appen for the IARPA
(Intelligence Advanced Research Projects Activity) Babel program. It contains
approximately 214 hours of Pashto conversational and scripted telephone speech
collected in 2011 and 2012 along with corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Pashto
speech in this release represents that spoken in four dialect regions of
Afghanistan and Pakistan. The gender distribution among speakers is
approximately 30% female, 70% male;
speakers' ages range from 17 years to 70 years. Calls were made using
different telephones (e.g., mobile, landline) from a variety of environments
including the street, a home or office, a public place, and inside a vehicle.
Transcripts are available in two versions: an extended
Arabic script and a modified Buckwalter
transliteration scheme, both encoded in UTF-8.
IARPA Babel Pashto Language Pack IARPA is distributed via
web download.
2016 Subscription Members will receive two copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2016 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(4) GALE Phase 4 Arabic Broadcast News
Parallel Sentences was developed by LDC.
Along with other corpora, the parallel text in this release comprised training
data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source sentences and corresponding
English translations selected from broadcast news data collected by LDC in 2007
and 2008 and transcribed and translated by LDC or under its direction.
GALE Phase 4
Arabic Broadcast News Parallel Sentences is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.