Linguistic Data Consortium: CIEMPIESS

Showing posts with label CIEMPIESS. Show all posts

Monday, May 20, 2019

LDC 2019 May Newsletter

New Publications:
Multi-Language Conversational Telephone Speech 2011 -- English Group

TAC KBP Chinese Regular SlotFilling - Comprehensive Training and Evaluation Data 2014

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c __________________________________________________________

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 -- English Group was developed by LDC and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Calls are labeled by human auditors for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Slavic Group (LDC2016S11)
Turkish (LDC2017S09)
South Asian (LDC2017S14)
Central Asian (LDC2018S03)
Central European (LDC2018S08)
Spanish (LDC2018S12)
Arabic (LDC2019S02)

Multi-Language Conversational Telephone Speech 2011 -- English Group is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 2014. This release includes queries, the 'manual runs' (human-produced responses to the queries), the final rounds of assessment results, and the complete set of Chinese source documents.

The regular Chinese Slot Filling evaluation track involved mining information about entities from text. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection.

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) CIEMPIESS Experimentation (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Facultad de Ingeniería at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Experimentation is a set of three different data sets, specifically Complementary, Fem, and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 10 hours of broadcast speech and transcripts and is intended for use as a standard test data set alongside other CIEMPIESS corpora.

Most of the speech recordings in Fem and Test were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). Those two channels feature videos with speech around legal issues and topics related to UNAM. The Complementary recordings consist of read speech collected for that corpus.

LDC has released the following data sets in the CIEMPIESS series:

CIEMPIESS (LDC2015S07)
CHM150 (LDC2016S04)
CIEMPIESS Light (LDC2017S23)
CIEMPIESS Balance (LDC2018S11)

CIEMPIESS Experimentation is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(4) IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. This corpus contains approximately 198 hours of Guarani conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Guarani speech in this release represents that spoken in Paraguay. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, August 16, 2018

LDC 2018 August Newsletter

LDC at Interspeech 2018

Fall 2018 LDC Data Scholarship Program

New Publications:

BOLT English SMS/Chat

CIEMPIESS Balance

2011 NIST Language Recognition Evaluation Test Set

_______________________________________________________________________

LDC at Interspeech 2018

LDC will participate in various ways at Interspeech 2018 held this year in Hyderabad, India, September 2-6. It is co-organizing the special session, The First DIHARD Speech Diarization Challenge, on September 3 and is a sponsor of the September 1 pre-conference workshop, Young Female Researchers in Speech Science & Technology (YFRSW). Results of recent work will be presented during the poster session on September 3, “Global TIMIT: Acoustic Phonetic Datasets for the World’s Languages.”

Fall 2018 LDC Data Scholarship Program

Students can apply for the Fall 2018 Data Scholarship Program now through September 15, 2018. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships.

New publications:

(1) BOLT English SMS/Chat was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection from native English speakers. The corpus contains 18,429 conversations totaling 3,674,802 words across 375,967 messages.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT English SMS/Chat is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) CIEMPIESS Balance (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish broadcast speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Balance is a companion corpus to CIEMPIESS Light, released by LDC as LDC2017S23. It was developed so that the data sets together constitute a gender-balanced corpus. The gender breakdown in CIEMPIESS Light is approximately 75% male and 25% female. In CIEMPIESS Balance, the gender breakdown is approximately 25% male and 75% female.

The majority of the speech recordings were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). These two channels feature videos with speech around legal issues and topics related to UNAM.

CIEMPIESS Balance is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(3) 2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by LDC between 2009 and 2011 in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Panjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian, and Urdu.

The 2011 evaluation emphasized the language pair condition and involved both conversational telephone speech (CTS) and broadcast narrow-band speech (BNBS).

This release includes training data for nine language varieties that had not been represented in prior LRE cycles -- Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Panjabi, Polish and Slovak -- contained in 893 audited segments of roughly 30 seconds duration and in 400 full-length CTS recordings. The evaluation test set comprises a total of 29,511 audio files, all manually audited at LDC for language and divided equally into three different test conditions according to the nominal amount of speech content per segment.

LDC released the prior LREs as:

2003 NIST Language Recognition Evaluation (LDC2006S31)
2005 NIST Language Recognition Evaluation (LDC2008S05)
2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)
2009 NIST Language Recognition Evaluation Test Set (LDC2014S06)

2011 NIST Language Recognition Evaluation Test Set is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.