Linguistic Data Consortium: LRE 2011

Showing posts with label LRE 2011. Show all posts

Friday, May 15, 2020

LDC 2020 May Newsletter

New Publications:

LORELEI Entity Detection and Linking Knowledge Base
BOLT English Translation Treebank - Chinese Discussion Forum

Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese

_______________________________________________________________

New publications:

(1) LORELEI Oromo Incident Language Pack was developed by LDC and is comprised of approximately 3.9 million words of Oromo monolingual text, 25,000 words of English monolingual text, 135,000 words of parallel and comparable Oromo-English text, and 50,000 words of data annotated for Entity Discovery and Linking and Situation Frames. It contains all of the text data, annotations, supplemental resources and related software tools for the Oromo language that were used in the DARPA LORELEI / LoReHLT 2017 Evaluation.

The evaluation protocol was based on a scenario in which an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity Detection and Linking and Situation Frame annotations identified “entities,” “needs” (such as a need for food) and “issues” (such as civil unrest) to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information that would be useful for planning a disaster response effort.

The knowledge base for the entity linking annotation in this corpus is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Oromo Incident Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(1) LORELEI Entity Detection and Linking Knowledge Base was developed by LDC and contains the full LORELEI Entity Detection and Linking (EDL) Knowledge Base (KB) used for all LORELEI Representative Language and Incident Language Pack entity linking annotation. The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.

The KB in this release supported the EDL task in LORELEI for four entity types -- geo-political entities (GPE), locations (LOC), persons (PER) and organizations (ORG) -- and contains a total of 10,216,832 entities. There are four inputs to the KB, each designated by a unique "origin" code in the KB, as follows: GPE and LOC entities from a snapshot of GeoNames, PER entities from the CIA World Leaders List, ORG entities from Appendix B of the CIA World Factbook, and additional entities manually created by LDC for each of the representative and incident languages in the LORELEI Program.

LORELEI Entity Detection and Linking Knowledge Base is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) BOLT English Translation Treebank - Chinese Discussion Forum was developed by LDC and consists of 147,432 tokens of web discussion forum data translated from Chinese to English and annotated for part-of-speech and syntactic structure.

The source data is Chinese discussion forum web text collected by LDC in 2011 and 2012, translated into English and released in BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05). A subset of the translated text -- 148 files representing 147,432 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release.

Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.

BOLT English Translation Treebank - Chinese Discussion Forum is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese was developed by LDC and is comprised of approximately 25 hours of telephone speech in Mandarin Chinese.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Slavic Group (LDC2016S11)
Turkish (LDC2017S09)
South Asian (LDC2017S14)
Central Asian (LDC2018S03)
Central European (LDC2018S08)
Spanish (LDC2018S12)
Arabic (LDC2019S02)
English (LDC2019S06)
East Asian (LDC2019S15)

Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, August 15, 2019

LDC 2019 August Newsletter

Fall 2019 LDC Data Scholarship Program

New Publications:

Corpus of Conversational Persian Transcripts

TAC KBP Evaluation Source Corpora 2016-2017

Multi-Language Conversational Telephone Speech 2011 -- East Asian

IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c

__________________________________________________________________

Fall 2019 LDC Data Scholarship Program

Students can apply for the Fall 2019 LDC Data Scholarship program now through September 15, 2019. This scholarship program provides eligible students with access to LDC data at no cost. For application requirements and program rules, please visit the LDC Data Scholarship page.

New publications:

(1) Corpus of Conversational Persian Transcripts contains transcripts from approximately 20 hours of naturally occurring informal conversations in the Tehrani dialect of Iranian Persian.

This data set is extracted from 1,201 minutes of conversations among 22 participants (12 male and 10 female) who recorded their daily phone calls and face-to-face interactions in a variety of informal settings. Conversations represent various interaction types (dialogue and group conversation), settings (home, office, car, café and restaurant), types of relationship (family, couple, friend, acquaintance), and various communicative goals (joking, explaining, arguing, and complaining, among others). The corresponding speech is not included in this release.

The transcripts were annotated for gender, age, and recording method and setting.

Corpus of Conversational Persian Transcripts is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) TAC KBP Evaluation Source Corpora 2016-2017 was developed by LDC and contains the 180,003 Chinese, English and Spanish source documents used in support of all TAC KBP evaluation tracks conducted in 2016 and 2017.

The source data consists of Chinese, English and Spanish discussion forum and newswire text collected by LDC. Also provided are a series of lists and tables to aid in the recreation of specific test sets.

Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST), developed to encourage research in natural language processing and related applications. The Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

TAC KBP Evaluation Source Corpora 2016-2017 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Multi-Language Conversational Telephone Speech 2011 -- East Asian was developed by LDC and is comprised of approximately 19 hours of telephone speech in two distinct languages of East Asia: Thai and Lao.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Calls are labeled by human auditors for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Slavic (LDC2016S11)
Turkish (LDC2017S09)
South Asian (LDC2017S14)
Central Asian (LDC2018S03)
Central European (LDC2018S08)
Spanish (LDC2018S12)
Arabic (LDC2019S02)
English (LDC2019S06)

Multi-Language Conversational Telephone Speech 2011 -- East Asian is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Igbo conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Igbo speech in this release represents the Owerri, Onitsha, and Ngwa dialects spoken in Nigeria. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, August 16, 2018

LDC 2018 August Newsletter

LDC at Interspeech 2018

Fall 2018 LDC Data Scholarship Program

New Publications:

BOLT English SMS/Chat

CIEMPIESS Balance

2011 NIST Language Recognition Evaluation Test Set

_______________________________________________________________________

LDC at Interspeech 2018

LDC will participate in various ways at Interspeech 2018 held this year in Hyderabad, India, September 2-6. It is co-organizing the special session, The First DIHARD Speech Diarization Challenge, on September 3 and is a sponsor of the September 1 pre-conference workshop, Young Female Researchers in Speech Science & Technology (YFRSW). Results of recent work will be presented during the poster session on September 3, “Global TIMIT: Acoustic Phonetic Datasets for the World’s Languages.”

Fall 2018 LDC Data Scholarship Program

Students can apply for the Fall 2018 Data Scholarship Program now through September 15, 2018. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships.

New publications:

(1) BOLT English SMS/Chat was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection from native English speakers. The corpus contains 18,429 conversations totaling 3,674,802 words across 375,967 messages.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT English SMS/Chat is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) CIEMPIESS Balance (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish broadcast speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Balance is a companion corpus to CIEMPIESS Light, released by LDC as LDC2017S23. It was developed so that the data sets together constitute a gender-balanced corpus. The gender breakdown in CIEMPIESS Light is approximately 75% male and 25% female. In CIEMPIESS Balance, the gender breakdown is approximately 25% male and 75% female.

The majority of the speech recordings were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). These two channels feature videos with speech around legal issues and topics related to UNAM.

CIEMPIESS Balance is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(3) 2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by LDC between 2009 and 2011 in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Panjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian, and Urdu.

The 2011 evaluation emphasized the language pair condition and involved both conversational telephone speech (CTS) and broadcast narrow-band speech (BNBS).

This release includes training data for nine language varieties that had not been represented in prior LRE cycles -- Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Panjabi, Polish and Slovak -- contained in 893 audited segments of roughly 30 seconds duration and in 400 full-length CTS recordings. The evaluation test set comprises a total of 29,511 audio files, all manually audited at LDC for language and divided equally into three different test conditions according to the nominal amount of speech content per segment.

LDC released the prior LREs as:

2003 NIST Language Recognition Evaluation (LDC2006S31)
2005 NIST Language Recognition Evaluation (LDC2008S05)
2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)
2009 NIST Language Recognition Evaluation Test Set (LDC2014S06)

2011 NIST Language Recognition Evaluation Test Set is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, May 15, 2017

LDC May 2017 Newsletter

In this newsletter:

Recent Collaborations

New publications:

IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a

Multi-Language Conversational Telephone Speech 2011 -- Turkish

Phrase Detectives Corpus

The EventStatus Corpus

______________________________________________________________

Recent Collaborations

Collaborations play an important role in many LDC activities. Over the past twenty-five years, LDC has partnered, consulted, and otherwise “collaborated” with a variety of organizations to advance research community goals. Recently, LDC partnered with Oxford Wave Research to integrate its latest speech technology into data collection and annotation processes. LDC also supports the Hearables Challenge sponsored by the National Science Foundation by creating and distributing training and test corpora. Finally, LDC Executive Director Chris Cieri is working with international colleagues to plan LREC2018 as a member of the Conference Programme Committee.

LDC welcomes new collaborations. Let us know what interests you and how we can work together. Contact LDC to begin the conversation.

New publications:

(1) IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Lao conversational and scripted telephone speech collected in 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Lao speech in this release represents that spoken in the Vientiane dialect region in Laos. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Multi-Language Conversational Telephone Speech 2011 -- Turkish was developed by LDC and is comprised of approximately 18 hours of telephone speech in Turkish. The data was collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE).

Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. Demographic information about the participants was not collected.

LDC has also released the Multi-Language Conversation Telephone Speech 2011 -- Slavic Group (LDC2016S11)

Multi-Language Conversational Telephone Speech 2011 -- Turkish is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Phrase Detectives Corpus was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" designed to collect data about English anaphoric coreference.

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. Annotations are comprised of a gold standard version created by multiple experts, as well as a set created by a large non-expert crowd (via the Phase Detectives game).

The data was annotated according to a prevalent linguistically-oriented approach for anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01) and The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Phrase Detectives Corpus is distributed via web download.

(4) The EventStatus Corpus was developed by researchers at Texas A&M University, Stanford University and The University of Utah. It consists of approximately 3,000 English and 1,500 Spanish news articles about civil unrest events annotated with temporal tags.

This corpus was designed to support the study of the temporal and aspectual properties of major events, that is, whether an event has already happened, is currently happening or may happen in the future. Since it focuses on a single domain (civil unrest events), it may be appropriate for tasks such as event extraction and temporal question answering.

The relevant news articles were sourced from English Gigaword Fifth Edition (LDC2017T09) and Spanish Gigaword Third Edition (LDC2011T12). The civil unrest events include protests, demonstrations, marches and strikes.

The EventStatus Corpus is distributed via web download.