New publications:
LDC2012T05
- Chinese Dependency Treebank 1.0 -
LDC2012T06
- GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 -
LDC2012S06
- Turkish Broadcast News Speech and Transcripts -
- Chinese Dependency Treebank 1.0 -
LDC2012T06
- GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 -
LDC2012S06
- Turkish Broadcast News Speech and Transcripts -
To
date almost 100 organizations have joined for Membership Year (MY) 2012, our
20th anniversary year. Once again LDC's early renewal discount
program has resulted in significant savings for our members. Organizations that renewed membership or joined early for MY2012 saved almost
US$60,000! MY 2011 members are still eligible for a 5% discount when renewing
for MY2012. This discount will apply throughout 2012, regardless of time of
renewal.
Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora. Please visit our Members FAQ for further information.
New
Publications
(1) Chinese Dependency Treebank 1.0
was developed by the Harbin Institute of Technology's Research
Center for Social Computing and Information Retrieval (HIT-SCIR). It
contains 49,996 Chinese sentences (902,191 words) randomly selected from
People's Daily newswire stories published between 1992 and 1996 and annotated
with syntactic dependency structures. Ill-formed or short sentences were
eliminated from the randomly-selected sentences prior to annotation. The data
was segmented and annotated for part of speech (POS), syntactic structures,
verb subclasses and noun compounds. Word segmentation and POS tagging were
accomplished automatically using statistical models trained on a larger,
annotated corpus of People's Daily newswire stories. Humans manually annotated
the syntactic structures and corrected word segmentation errors. POS tags were
not corrected.
The
data is provided in the format of CoNLL-X and in UTF-8. Chinese
Dependency Treebank 1.0 is distributed via web download. 2012 Subscription Members will automatically receive one copy of this data on
disc. 2012 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US$300.
*
(2) GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised machine translation training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction.
GALE
Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 includes 36
source-translation document pairs, comprising 169,109 words of Arabic source
text and its English translation. Data is drawn from thirteen distinct Arabic
programs broadcast between 2004 and 2007 from the following sources: Al Alam
News Channel, Aljazeera, Dubai TV, Oman TV, and Radio Sawa. Broadcast
conversation programming is generally more interactive than traditional news
broadcasts and includes talk shows, interviews, call-in programs and roundtable
discussions. The programs in this release focus on current events topics.
The
files in this release were transcribed by LDC staff and/or transcription
vendors under contract to LDC in accordance with Quick Rich Transcription
guidelines developed by LDC. Transcribers indicated sentence boundaries in
addition to transcribing the text. Data was manually selected for translation
according to several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files were then
reformatted into a human-readable translation format and assigned to
translation vendors. Translators followed LDC's Arabic to English translation
guidelines which are included with this release. Bilingual LDC staff performed
quality control procedures on the completed translations.
Source
data and translations are distributed in TDF format. All data are encoded in
UTF8. GALE
Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 is distributed via
web download. 2012 Subscription Members will automatically receive one copy of this data on
disc. 2012 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US$1750.
*
(3) Turkish Broadcast News Speech and Transcripts was developed by Boğaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval.
The
VOA material was collected between December 2006 and June 2009 using a PC and
TV/radio card setup. The data collected during the period 2006-2008 was
recorded from analog FM radio; the 2009 broadcasts were recorded from digital
satellite transmissions. A quick manual segmentation and transcription approach
was followed.
The
data was recorded at 32 kHz and re-sampled at 16 kHz. After screening for
recording quality, the files were segmented, transcribed, and verified. The
segmentation occurred in two steps, an initial automatic segmentation followed
by manual correction and annotation which included information such as
background conditions and speaker boundaries.
The
transcription guidelines were adapted from the LDC HUB4 and quick transcription
guidelines. An English version of the adapted guidelines is provided with the
data. Manual segmentation and transcripts were created by native Turkish
speakers at Boğaziçi University using Transcriber. The transcriptions are provided in
the ISO-8859-9 (Latin5) character set.
Turkish
Broadcast News Speech and Transcripts is distributed on four DVDs. 2012
Subscription Members will automatically receive one copy of this data. 2012 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for US$2000.