Linguistic Data Consortium: January 2017

LDC Membership Discounts for MY2017 Still Available

New publications:

Arabic Speech Recognition Pronunciation Dictionary

IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7

GALE Phase 3 and 4 Chinese Web Parallel Text

___________________________________________________________________

LDC Membership Discounts for MY2017 Still Available

Join LDC now while membership savings are still available. 2016 members receive a 10% discount when renewing before March 1, 2017, or a 5% discount when renewing any time in 2017. Non-consecutive members and new members receive a 5% discount when renewing before March 1, 2017. Membership remains the most economical way to access LDC releases. This year’s planned publications include 2010 NIST Speaker Recognition Evaluation data set, Multilanguage Conversational Telephone Speech, Noisy TIMIT, IARPA Babel Language Packs, RATS Keyword Spotting, BOLT parallel and word-aligned data in all languages and more. Browse the Members pages for details on membership options and benefits.

New Corpora

(1) Arabic Speech Recognition Pronunciation Dictionary was developed by the Qatar Computing Research Institute. It contains approximately two million pronunciation entries for 526,000 Modern Standard Arabic words, for an average of 3.84 pronunciations for each grapheme word. The dictionary was developed from news archive resources, including the Arabic news website Aljazeera.net. The selected words were those that occurred more than once in the news collection.

Arabic Speech Recognition Pronunciation Dictionary is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Vietnamese speech in this release represents that spoken in the North, North-Central, Central and Southern dialect regions in Vietnam. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee..

(3) MWE-Aware English Dependency Corpus was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from the Wall Street Journal portion of OntoNotes Release 5.0 (LDC2013T19).

Compound function words are a type of multiword expression (MWE). MWEs are groups of tokens that can be treated as a single semantic or syntactic unit. Doing so facilitates natural language processing tasks such as constituency and dependency parsing.

MWE-Aware English Dependency Corpus is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee..

(4) GALE Phase 3 and 4 Chinese Web Parallel Text was developed by LDC and contains Chinese source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.

The data includes 88 source-translation document pairs, comprising 67,514 tokens of Chinese source text and its English translation.

GALE Phase 3 and 4 Chinese Web Parallel Text is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Thursday, January 19, 2017

LDC January 2017 Newsletter