Fall 2018 LDC Data Scholarship Program
New Publications:
_______________________________________________________________________
LDC at Interspeech 2018
LDC will participate in various ways at Interspeech 2018
held this year in Hyderabad, India, September 2-6. It is co-organizing the special
session, The First DIHARD Speech Diarization Challenge, on September 3 and is a sponsor of the September 1 pre-conference
workshop, Young Female Researchers in Speech Science & Technology (YFRSW). Results of recent work will be
presented during the poster session on September 3, “Global TIMIT: Acoustic
Phonetic Datasets for the World’s Languages.”
Fall 2018 LDC Data Scholarship Program
Students can apply for the Fall 2018 Data Scholarship
Program now through September 15, 2018. The LDC Data Scholarship program
provides students with access to LDC data at no cost. For more information on
application requirements and program rules, please visit LDC Data Scholarships.
New publications:
(1) BOLT English
SMS/Chat was developed by LDC and consists of
naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected
through data donations and live collection from native English speakers. The corpus contains 18,429
conversations totaling 3,674,802 words across 375,967 messages.
The BOLT
(Broad Operational Language Translation) program developed machine translation
and information retrieval for less formal genres, focusing particularly on
user-generated content. LDC supported the BOLT program by collecting informal
data sources -- discussion forums, text messaging, and chat -- in Chinese,
Egyptian Arabic and English. The collected data was translated and annotated
for various tasks including word alignment, treebanking, propbanking and
co-reference.
BOLT English SMS/Chat is available via web download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(2) CIEMPIESS
Balance (Corpus de Investigación en Español de México del
Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the
Development of Speech Technologies program at the School of Engineering at
the National
Autonomous University of Mexico (UNAM) and consists of
approximately 18 hours of Mexican Spanish broadcast speech with associated
transcripts. The goal of this work was to create acoustic models for automatic
speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.
CIEMPIESS Balance is a companion corpus to CIEMPIESS Light,
released by LDC as LDC2017S23.
It was developed so that the data sets together constitute a gender-balanced
corpus. The gender breakdown in CIEMPIESS Light is approximately 75% male and
25% female. In CIEMPIESS Balance, the gender breakdown is approximately 25% male
and 75% female.
The majority of the speech recordings were collected from Radio-IUS,
a UNAM radio station. Other recordings were taken from IUS Canal
Multimedia and Centro
Universitario de Estudios Jurídicos (CUEJ UNAM). These two
channels feature videos with speech around legal issues and topics related to
UNAM.
CIEMPIESS Balance is available via web download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data at no cost.
*
(3) 2011 NIST
Language Recognition Evaluation Test Set contains selected
training data and the evaluation test set for the 2011 NIST Language
Recognition Evaluation. It consists of approximately 204 hours of
conversational telephone speech and broadcast audio collected by LDC between
2009 and 2011 in the following 24 languages and dialects: Arabic (Iraqi),
Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari,
English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Panjabi,
Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian, and
Urdu.
The 2011
evaluation emphasized the language pair condition and involved both
conversational telephone speech (CTS) and broadcast narrow-band speech (BNBS).
This release includes training data for nine language
varieties that had not been represented in prior LRE cycles -- Arabic (Iraqi),
Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Panjabi,
Polish and Slovak -- contained in 893 audited segments of roughly 30 seconds
duration and in 400 full-length CTS recordings. The evaluation test set
comprises a total of 29,511 audio files, all manually audited at LDC for
language and divided equally into three different test conditions according to
the nominal amount of speech content per segment.
LDC released the prior LREs as:
- 2003 NIST Language Recognition Evaluation (LDC2006S31)
- 2005 NIST Language Recognition Evaluation (LDC2008S05)
- 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
- 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)
- 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06)
2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.