Linguistic Data Consortium: October 2016

In this newsletter:

Fall 2016 LDC Data Scholarship recipients

Chilin HK and LDC partner on distribution of parallel patent data

New publications:

IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5

Fall 2016 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Fall 2016 data scholarships:

Tiba Zaki Abdulhameed: Western Michigan University (USA); PhD Candidate, Computer Science. Tiba is awarded copies of GALE Phase 2 Arabic Broadcast Conversation Speech and Transcripts for her research in dialectal ASR.

Abhishek Abhishek: Indian Institute of Technology Guwahati (India); PhD Candidate, Computer Science and Engineering. Abhishek is awarded a copies of ACE 2004 Multilingual Training Corpus and The New York Times Annotated Corpus for his research in coreference resolution and relation extraction.

Sara Ebrahim: Ain Shams University (Egypt); Msc, Computer Science. Sara is awarded copies of LDC Standard Arabic Morphological Analyzer and NIST OpenMT 2008 Evaluation Selected References and System Translations for her work in machine translation.

Katherine Metcalf: Indiana University (USA), PhD Candidate, Computer Science. Katherine is awarded a copy of Emotional Prosody Speech and Transcripts for her research in acoustic/prosodic approaches to classifying emotional states.

Mousmita Sarma: Gauhati University (India), Post-Masters Research, Electronics and Communications Technology. Mousmita is awarded copies of Switchboard 1-Release 2 and IARPA Babel Assamese Language Pack for her research in Assamese dialect identification.

For program information visit the Data Scholarship page.

Chilin HK and LDC partner on distribution of parallel patent data

Chilin HK Limited (Chilin) and LDC are pleased to announce that the parallel data source developed by Chilin, A Corpus of Chinese-English Parallel Sentences Extracted from Patents, is now available through the LDC Catalog. This is a special release in addition to the LDC scheduled corpora for membership year 2016, available under separate terms.

The Chilin Corpus has primarily resulted from training corpus and test sets developed specifically for the Tokyo-based NTCIR 2009 & 2010 competitions on Patent MT (machine translation), which drew more than 30 international teams:

NTCIR-9: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/NTCIR/01-NTCIR9-PATENTMT-GotoI.pdf

NTCIR-10: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings10/pdf/NTCIR/OVERVIEW/01-NTCIR10-PATENTMT-GotoI.pdf

The training corpus is drawn from a much larger curated corpus of parallel Chinese-English sentences and sentence fragments which have been winnowed from an even larger corpus of more than 300k parallel Chinese-English patents in different fields, initially at the Research Centre on Language Information Sciences, City University of Hong Kong (authors: Benjamin Tsou, Bin Lu, and Kapo Chow). This data set is available from LDC under the following reference:

LDC2016T22 A Corpus of Chinese-English Parallel Sentences Extracted from Patents

Not-for-profit organizations may license this data set for US$25.00 under the LDC Not-for-Profit Membership Agreement or under the LDC User Agreement for Non-Members for use in linguistic research, education and non-commercial technology development. For-profit organizations may license this data for US$5000, discounted to US$4000 for LDC for-profit members, under a commercial license.

New Corpora

(1) IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 213 hours of Turkish conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Turkish speech in this release represents that spoken in seven dialect regions in Turkey. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Turkish Language Pack IARPA is distributed via web download.

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the special license agreement. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) KAFD: Arabic Font Database was developed by King Fahd University of Petroleum & Minerals and Qassim University. It is comprised of approximately 2.5 million scanned Arabic printed pages in a variety of fonts, sizes and resolutions along with corresponding transcripts. KAFD was designed for research in Arabic text recognition.

The scanned Arabic texts were collected from publications covering various subjects such as religion, medicine, science and history. Texts were printed in 40 different fonts, 10 sizes and four styles. Scans were made at 100, 200, 300 and 600 dpi (dots per inch).

The database is available in two formats: at the page level and at the line level. Images are presented as TIFF images and transcripts are in plain text format. Individual font folders are compressed into RAR archives.

The data is divided into training, validation and test sets.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Richer Event Description was developed by the University of Colorado Boulder-CLEAR (Computational Language and Education Research), Carnegie Mellon University and LDC. It consists of coreference, bridging and event-event relations (temporal, causal, subevent and reporting relations) annotations over 95 English newswire, discussion forum and narrative text documents, covering all events, times and non-eventive entities within each document.
RED annotation is intended to join different annotation layers and to provide a rich representation of event phenomena.

Documents were annotated twice -- in a markable pass and in an event annotation phase. Annotation and source documents are divided into three partitions: (1) 20 newswire summarization documents, (2) 20 discussion forum documents and newswire annotations used in the original RED pilot annotations, and (3) 55 documents annotated by a range of DEFT (Deep Exploration and Filtering of Test) annotation formats. Data is presented as UTF-8 encoded xml and plain text.

Linguistic Data Consortium

Wednesday, October 19, 2016

LDC October 2016 Newsletter