Linguistic Data Consortium: April 2017

LDC celebrates 25 years

LDC data and commercial technology development

New publications:

2010 NIST Speaker Recognition Evaluation Test Set

BOLT Egyptian Arabic SMS/Chat and Transliteration

_________________________________________________________________________

LDC celebrates 25 years

April 2017 marks the beginning of LDC’s 25^th year as the leader in language resource development and distribution. Founded in 1992, the Consortium has grown from a data repository to a vibrant data center that creates, shares and archives language resources. The Catalog continues to grow, boasting over 700 titles in more than 90 languages. With the support of members, licensees, sponsors and collaborators, LDC has distributed over 120,000 copies of data to more than 3,500 organizations worldwide. Our heartfelt thanks for your support as we continue our mission to provide large quantities of diverse data, research program support and high quality member services.

LDC data and commercial technology development

Any organization wishing to use LDC data to develop or test products for commercialization or use LDC data in any commercial product or for any commercial purpose, must first license the data as a For-Profit Member. Once the data is licensed under the For-Profit Membership, the organization retains perpetual rights to use the data for commercial technology development. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for more information.

New Corpora

(1) 2010 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and interview speech recorded over a microphone channel used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation (SRE).

The telephone speech segments include two-channel excerpts of approximately 10 seconds and 5 minutes. There are also summed-channel excerpts in the range of 5 minutes. The microphone excerpts are 3-15 minutes in duration. As in prior evaluations, intervals of silence were not removed.

The 2010 evaluation includes not only conversational telephone speech (CTS) recorded over ordinary telephone channels for the core training and test conditions, but also CTS and conversational interview speech recorded over a room microphone channel. Unlike prior evaluations, some of the conversational telephone style speech was collected in a manner to produce particularly high, or particularly low, vocal effort on the part of the speaker of interest. In addition to evaluation data, this package also consists of answer keys, trial and train files, development data and evaluation documentation.

2010 NIST Speaker Recognition Evaluation Test Set is distributed via hard drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic SMS/Chat and Transliteration was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Egyptian Arabic. The corpus contains 5,691 conversations totaling 1,029,248 words across 262,026 messages. Messages were natively written in either Arabic orthography or romanized Arabizi. A total of 1,856 Arabizi conversations (287,022 words) were transliterated from the original romanized Arabizi script into standard Arabic orthography and then reviewed, corrected and normalized by LDC annotators according to "Conventional Orthography for Dialectal Arabic" (CODA).

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT Egyptian Arabic SMS/Chat and Transliteration is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) CHiME2 Grid was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 120 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments.

CHiME2 Grid reflects the small vocabulary track of the CHiME2 Challenge. The target utterances were taken from the Grid corpus and consist of 34 speakers reading simple 6-word sequences. The Data is divided into training, development and test sets.

CHiME2 Grid is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, April 17, 2017

LDC April 2017 Newsletter