Linguistic Data Consortium: May 2012

Early Renewing Members Save Again!

New publications:

LDC2012T05
- Chinese Dependency Treebank 1.0 -

LDC2012T06
- GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 -

LDC2012S06
- Turkish Broadcast News Speech and Transcripts -

Early Renewing Members Save Again!

To date almost 100 organizations have joined for Membership Year (MY) 2012, our 20th anniversary year. Once again LDC's early renewal discount program has resulted in significant savings for our members. Organizations that renewed membership or joined early for MY2012 saved almost US$60,000! MY 2011 members are still eligible for a 5% discount when renewing for MY2012. This discount will apply throughout 2012, regardless of time of renewal.

Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora. Please visit our Members FAQ for further information.

New Publications

(1) Chinese Dependency Treebank 1.0 was developed by the Harbin Institute of Technology's Research Center for Social Computing and Information Retrieval (HIT-SCIR). It contains 49,996 Chinese sentences (902,191 words) randomly selected from People's Daily newswire stories published between 1992 and 1996 and annotated with syntactic dependency structures. Ill-formed or short sentences were eliminated from the randomly-selected sentences prior to annotation. The data was segmented and annotated for part of speech (POS), syntactic structures, verb subclasses and noun compounds. Word segmentation and POS tagging were accomplished automatically using statistical models trained on a larger, annotated corpus of People's Daily newswire stories. Humans manually annotated the syntactic structures and corrected word segmentation errors. POS tags were not corrected.

The data is provided in the format of CoNLL-X and in UTF-8. Chinese Dependency Treebank 1.0 is distributed via web download. 2012 Subscription Members will automatically receive one copy of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$300.

(2) GA LE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised machine translation training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 includes 36 source-translation document pairs, comprising 169,109 words of Arabic source text and its English translation. Data is drawn from thirteen distinct Arabic programs broadcast between 2004 and 2007 from the following sources: Al Alam News Channel, Aljazeera, Dubai TV, Oman TV, and Radio Sawa. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines which are included with this release. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. All data are encoded in UTF8. GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 is distributed via web download. 2012 Subscription Members will automatically receive one copy of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.

(3) T urkish Broadcast News Speech and Transcripts was developed by Boğaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval.

The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio; the 2009 broadcasts were recorded from digital satellite transmissions. A quick manual segmentation and transcription approach was followed.

The data was recorded at 32 kHz and re-sampled at 16 kHz. After screening for recording quality, the files were segmented, transcribed, and verified. The segmentation occurred in two steps, an initial automatic segmentation followed by manual correction and annotation which included information such as background conditions and speaker boundaries.

The transcription guidelines were adapted from the LDC HUB4 and quick transcription guidelines. An English version of the adapted guidelines is provided with the data. Manual segmentation and transcripts were created by native Turkish speakers at Boğaziçi University using Transcriber. The transcriptions are provided in the ISO-8859-9 (Latin5) character set.

Turkish Broadcast News Speech and Transcripts is distributed on four DVDs. 2012 Subscription Members will automatically receive one copy of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

Linguistic Data Consortium

Wednesday, May 16, 2012

LDC May 2012 Newsletter