Linguistic Data Consortium: LDC 2019 May Newsletter

New Publications:
Multi-Language Conversational Telephone Speech 2011 -- English Group

TAC KBP Chinese Regular SlotFilling - Comprehensive Training and Evaluation Data 2014

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c __________________________________________________________

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 -- English Group was developed by LDC and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Calls are labeled by human auditors for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Slavic Group (LDC2016S11)
Turkish (LDC2017S09)
South Asian (LDC2017S14)
Central Asian (LDC2018S03)
Central European (LDC2018S08)
Spanish (LDC2018S12)
Arabic (LDC2019S02)

Multi-Language Conversational Telephone Speech 2011 -- English Group is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 2014. This release includes queries, the 'manual runs' (human-produced responses to the queries), the final rounds of assessment results, and the complete set of Chinese source documents.

The regular Chinese Slot Filling evaluation track involved mining information about entities from text. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection.

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) CIEMPIESS Experimentation (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Facultad de Ingeniería at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Experimentation is a set of three different data sets, specifically Complementary, Fem, and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 10 hours of broadcast speech and transcripts and is intended for use as a standard test data set alongside other CIEMPIESS corpora.

Most of the speech recordings in Fem and Test were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). Those two channels feature videos with speech around legal issues and topics related to UNAM. The Complementary recordings consist of read speech collected for that corpus.

LDC has released the following data sets in the CIEMPIESS series:

CIEMPIESS (LDC2015S07)
CHM150 (LDC2016S04)
CIEMPIESS Light (LDC2017S23)
CIEMPIESS Balance (LDC2018S11)

CIEMPIESS Experimentation is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(4) IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. This corpus contains approximately 198 hours of Guarani conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Guarani speech in this release represents that spoken in Paraguay. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, May 20, 2019

LDC 2019 May Newsletter

No comments:

Post a Comment