Linguistic Data Consortium: August 2018

LDC at Interspeech 2018

Fall 2018 LDC Data Scholarship Program

New Publications:

BOLT English SMS/Chat

CIEMPIESS Balance

2011 NIST Language Recognition Evaluation Test Set

_______________________________________________________________________

LDC at Interspeech 2018

LDC will participate in various ways at Interspeech 2018 held this year in Hyderabad, India, September 2-6. It is co-organizing the special session, The First DIHARD Speech Diarization Challenge, on September 3 and is a sponsor of the September 1 pre-conference workshop, Young Female Researchers in Speech Science & Technology (YFRSW). Results of recent work will be presented during the poster session on September 3, “Global TIMIT: Acoustic Phonetic Datasets for the World’s Languages.”

Fall 2018 LDC Data Scholarship Program

Students can apply for the Fall 2018 Data Scholarship Program now through September 15, 2018. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships.

New publications:

(1) BOLT English SMS/Chat was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection from native English speakers. The corpus contains 18,429 conversations totaling 3,674,802 words across 375,967 messages.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT English SMS/Chat is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) CIEMPIESS Balance (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish broadcast speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Balance is a companion corpus to CIEMPIESS Light, released by LDC as LDC2017S23. It was developed so that the data sets together constitute a gender-balanced corpus. The gender breakdown in CIEMPIESS Light is approximately 75% male and 25% female. In CIEMPIESS Balance, the gender breakdown is approximately 25% male and 75% female.

The majority of the speech recordings were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). These two channels feature videos with speech around legal issues and topics related to UNAM.

CIEMPIESS Balance is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(3) 2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by LDC between 2009 and 2011 in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Panjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian, and Urdu.

The 2011 evaluation emphasized the language pair condition and involved both conversational telephone speech (CTS) and broadcast narrow-band speech (BNBS).

This release includes training data for nine language varieties that had not been represented in prior LRE cycles -- Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Panjabi, Polish and Slovak -- contained in 893 audited segments of roughly 30 seconds duration and in 400 full-length CTS recordings. The evaluation test set comprises a total of 29,511 audio files, all manually audited at LDC for language and divided equally into three different test conditions according to the nominal amount of speech content per segment.

LDC released the prior LREs as:

2003 NIST Language Recognition Evaluation (LDC2006S31)
2005 NIST Language Recognition Evaluation (LDC2008S05)
2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)
2009 NIST Language Recognition Evaluation Test Set (LDC2014S06)

2011 NIST Language Recognition Evaluation Test Set is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Thursday, August 16, 2018

LDC 2018 August Newsletter