Thursday, August 15, 2024

LDC August 2024 Newsletter

Fall 2024 LDC Data Scholarship Program  

New publications:

LORELEI Uyghur Incident Language Pack

Ravnursson Faroese Speech and Transcripts

_______________________________________________________________

Fall 2024 LDC Data Scholarship Program  

Student applications for the Fall 2024 LDC Data Scholarship program are being accepted now through September 15, 2024. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

New publications: 

LORELEI Uyghur Incident Language Pack was developed by LDC and is comprised of 28 million words of Uyghur monolingual text, 500,000 words of English monolingual text, 3.3 million words of parallel and comparable Uyghur-English text, and 200,000 words annotated for simple named entities and situation frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Uyghur language that were used in the DARPA LORELEI / LoReHLT 2016 Evaluation.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Named entity annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

Ravnursson Faroese Speech and Transcripts contains 109 hours of Faroese prompted speech from 433 speakers (249 female, 184 male), corresponding transcripts and speaker metadata. It is an extract from the Basic Language Resource Kit 1.0 (BLARK 1.0) developed by the Faroe Islands' Ravnur Project.
 
Speech data was collected in 2022. Speakers from all major dialect areas in the Faroe Islands in three age groups -- 15-35, 36-60, and 61+ years -- read texts that included a word list, a phrase list, closed vocabulary readings, and short texts. Recordings also contain spontaneous speech. Orthographic transcripts are included. 

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data at no cost.
 

 

Monday, July 15, 2024

LDC July 2024 Newsletter

LDC at IC2S2

Fall 2024 LDC Data Scholarship Program 

New publications:

MATERIAL Bulgarian-English Language Pack

Dialogs Re-Enacted Across Languages

____________________________________________________________________

LDC at IC2S2 
LDC is delighted to be a bronze sponsor for the 10th International Conference on Computational Social Science (IC2S2) held this year on Penn’s campus July 17-20. The conference will feature research from around the world across a broad range of relevant fields to advance the many frontiers of computational social science. Be sure to visit LDC’s table during the poster sessions July 18 and 19 from 1:30-2:30 pm. 

Fall 2024 LDC Data Scholarship Program 
Student applications for the Fall 2024 LDC Data Scholarship program are being accepted now through September 15, 2024. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

New publications:

MATERIAL Bulgarian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains 80 hours of Bulgarian conversational telephone speech, transcripts, English translations, annotations and queries.

Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 40% of the speech files, and approximately 10% of the speech files were translated into English. This release also includes domain annotations, English queries and their relevance annotations. 

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

*

Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontaneous conversations and close re-enactments in the other language by the original speakers, for 3816 pairs of matching utterances. Data was collected in 2022-2023. Participants were recruited from among students at the University of Texas at El Paso; all were bilingual speakers of General American English and of Mexico-Texas Border Spanish.

Each speaker pair had a 10 minute conversation in one language. Various fragments from these conversations were chosen for re-enactment, and the original speakers produced equivalents in the other language. Each re-enactment was vetted for fidelity to the original and naturalness in the target language. Also included is metadata about conversations, participants, re-enactments and utterances.

2024 members can access this corpus through their LDC accounts. Non-members may license this data fora fee.

 

 

Monday, June 17, 2024

LDC June 2024 Newsletter

LDC data and commercial technology development

New publications:

Diaspora Tibetan Speech

AIDA Scenario 2 Practice Topic Annotation

_________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
 
Diaspora Tibetan Speech was developed at Yale University. It contains 28 hours of Tibetan elicited speech by 73 speakers from the diaspora Tibetan community in Kathmandu, Nepal, along with transcripts, elicitation materials and speaker metadata.

Recordings were collected in 2016. All speakers were adults and varied in age as well as age of diaspora. A substantial number of speakers were born in Nepal. Each speaker contributed one recording comprising a series of elicitation tasks: some demographic information; a word list and numbers; some sentences in isolation; a scripted story; and free speech based on "frog story" type illustrations.  Annotation and metadata formats include PDF and Word (some transcripts), Excel (some transcripts, speaker metadata) and Praat TextGrids (word and number lists). 

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. 

*

AIDA Scenario 2 Practice Topic Annotation was developed by LDC and is comprised of annotations for 29 English, Russian and Spanish documents (text, image and video) from AIDA Scenario 2  Practice Topic Source Data (LDC2024T04), specifically, the set of practice documents designated for annotation in Phase 2.

Annotations are presented as tab separated files in the following categories for each topic:

  • Mentions: single references in source data to a real-world entity or filler, event, or relation. 
  • Slots: pre-defined roles in an event or relation filled by an argument (entity mention).
  • Linking: entity mentions linked to entries in the knowledge base as a method of indicating the real-world entity to which an entity referred.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

 

 

Thursday, May 16, 2024

LDC May 2024 Newsletter

LDC at LREC-COLING 2024

New publications:
Call My Net 1
______________________________________________________________________

LDC at LREC-COLING 2024
LDC will be exhibiting at LREC-COLING 2024 hosted by the European Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL) May 20-25 in Turin, Italy. Stop by our table to learn more about recent developments at the Consortium and the latest publications. 

LDC staff members will also be presenting current work on topics including Spanless Event Annotation for Corpus-Wide Complex Event Understanding, Schema Learning Corpus: Data and Annotation Focused on Complex Events, and KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora. 

LDC will post conference updates via social media. We look forward to seeing you in Italy!

New publications: 

Call My Net 1 was developed by LDC and contains 364 hours of conversational telephone speech in four languages (Tagalog, Cebuano, Cantonese and Mandarin) collected in 2015 from 221 native speakers located in the Philippines and China along with metadata and speaker demographic information. Recordings and data from this collection were used to support the NIST 2016 Speaker Recognition Evaluation.

Speakers made 10 telephone calls each to people within their existing social networks, using different handsets and under a variety of noise conditions. Speakers were connected through a robot operator to carry on casual conversations on topics of their choice. All recordings were manually audited to confirm language and speaker requirements. The documentation for this release includes metadata about phone type, noise conditions and call quality. Speaker demographic information on year of birth, sex and native language is also included.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

Automatic Content Extraction for Portuguese was developed at INESC TEC - Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência and consists of automatic Brazilian Portuguese and European Portuguese translations of the English text and annotations in ACE 2005 Multilingual Training Corpus (LDC2006T06).

ACE 2005 Multilingual Training Corpus was developed by LDC to support the Automatic Contract Extraction (ACE) program, specifically, by providing training data for the 2005 technology evaluation. It contains 1,800 files of mixed genre text in Arabic, English and Chinese annotated for entities, relations and events. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form. Text genres included newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech.

For this translation, the English data was partitioned into training, development and test sets. The documents were split into sentences and each event mention was assigned to its sentence. Source sentences and their annotations were translated into Brazilian Portuguese using Google Translate and into European Portuguese using DeepL Translate. An alignment algorithm and a parallel corpus word aligner were used to handle mismatches between translated annotations and their translated sentences.
 
2024 members can access this corpus through their LDC account. Non-members may license this data for a fee. 
 
 

Monday, April 15, 2024

LDC April 2024 Newsletter

New publications:

LoReHLT Hausa Representative Language Pack

AIDA Scenario 2 Practice Topic Source Data
_______________________________________________________________

New publications:

LoReHLT Hausa Representative Language Pack was developed by LDC and is comprised of approximately 4.4 million words of Hausa monolingual text, 86,000 Hausa words translated from English data, and 30 minutes of Hausa audio recordings. Approximately 96,000 words were annotated for named entities and over 13,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 7,400 words. Over 9,600 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Data was collected from discussion forum, news, reference, social network, amateur web audio recordings, and weblogs.

LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. 

*

AIDA Scenario 2 Practice Topic Source Data was developed by LDC and is comprised of 1500 root documents (text, image, and video) from English, Russian, and Spanish web sources. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 2 scenario focused on the socioeconomic and political crisis in Venezuela since 2010. This corpus constitutes the full set of topic-focused documents for Phase 2 practice subtopics.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

The knowledge base for entity detection and linking annotation for all AIDA Scenario 1 and 2 corpora is available separately as AIDA Scenario 1 and 2 Reference Knowledge Base (LDC2023T10).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Sunday, March 17, 2024

LDC March 2024 Newsletter

LDC data and commercial technology development 

New publications:

RATS Low Speech Density

BabyEars Affective Vocalizations

___________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
RATS Low Speech Density was developed by LDC and is comprised of 87 hours of English, Levantine Arabic, Farsi, Pashto and Urdu speech and non-speech samples. The recordings were assembled by concatenating a randomized selection of speech, communications systems sounds, and silence. This corpus was created to measure false alarm performance in RATS speech activity detection systems.
 
The source audio was extracted from RATS development and progress sets and consists of conversational telephone speech recordings collected by LDC. Non-speech samples were selected from communications systems sounds, including telephone network special information tones, radio selective calling signals, HF/VHF/UHF digital mode radio traffic, radio network control channel signals, two-way radio traffic containing roger beeps, and short duration shift-key modulated handset data transmissions.
 
The goal of the RATS (Robust Automatic Transcription of Speech) program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

BabyEars Affective Vocalizations contains 22 minutes of spontaneous English speech by 12 adults interacting with their infant children, for a total of 509 infant-directed utterances and 185 adult-directed or neutral utterances. Speech data was collected in a quiet room during a one-hour session where each parent was asked to play and otherwise interact normally with their infant (aged 10-18 months). A trained research assistant then extracted discrete utterances and classified them in three categories: approval, attention and prohibition.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Thursday, February 15, 2024

LDC February 2024 Newsletter

LDC membership discounts expire March 1 

Spring 2024 data scholarship recipients

Four corpora withdrawn from the LDC Catalog

New publications:

Second Language University Speech Intelligibility Corpus

AIDA Scenario 1 Practice Topic Annotation

_________________________________________________________________

LDC membership discounts expire March 1 

Time is running out to save on 2024 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

Spring 2024 data scholarship recipients 

Congratulations to the recipients of LDC’s Spring 2024 data scholarships:

Jordan Chandler: Université Rennes 2 (France): Master’s student, English Studies. Jordan is awarded a copy of Penn Parsed Corpora of Historical English LDC2020T16 to continue his research on the historical development of adjective, quantifier and article indefiniteness in the English language.

Nikhil Raghav: TCG Crest (India): PhD candidate, Institute for Advancing Intelligence. Nikhil is awarded copies of Third DIHARD Challenge Development LDC2022S12 and Third DIHARD Challenge Evaluation LDC2022S14 for his work in speaker diarization. 

Abraham Sanders: Rensselaer Polytechnical Institute (USA): PhD candidate, Cognitive Science. Abraham is awarded copies of Fisher English Training Speech Part 1 Speech LDC2004S13, Fisher English Training Speech Part 1 Transcripts LDC2004T19, Fisher English Training Part 2 Speech LDC2005S13 and Fisher English Training Part 2 Transcripts LDC2005T19, for his work in spoken dialogue systems.  

The next round of applications will be accepted in September 2024. For information about the program, visit the Data Scholarships page.

Four corpora withdrawn from the LDC Catalog 

We regret to announce that The New York Times Annotated Corpus LDC2008T19 has been withdrawn from the LDC Catalog by the data provider. Because they contain data from LDC2008T19, the following three corpora are also withdrawn from the Catalog: Benchmarks for Open Relation Extraction LDC2014T27, Concretely Annotated New York Times LDC2018T12, and News Sub-domain Named Entity Recognition LDC2023T12. Organizations and individuals who have previously licensed any of these data sets can continue to use them under the terms of their respective special license agreements.

New publications:
 
Second Language University Speech Intelligibility Corpus was developed by Northern Arizona University, The Pennsylvania State University, and The University of Texas at Dallas. It contains 10.5 hours of English speech collected from 66 international faculty and university students representing 15 language backgrounds at 10 North American universities. This release also includes orthographic transcriptions for all recordings, intelligibility scores for 73% of the files, speaker metadata, and aligned Praat textgrids. 
 
The speech data is comprised of presentations, descriptions, reflections, and microteaching tasks. Speakers were recruited from courses at intensive English programs and oral skills courses for international graduate students seeking to become international teaching assistants. 
 
2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

*

AIDA Scenario 1 Practice Topic Annotation was developed by LDC and is comprised of annotations for 212 English, Russian and Ukrainian web documents (text, image and video) from AIDA Scenario 1 Practice Topic Source Data (LDC2023T11), specifically, the set of practice documents designated for annotation in Phase 1.

Annotations are presented as tab separated files in the following categories for each topic:

  • Mentions: single references in source data to a real-world entity or filler, event, or relation. 
  • Slots: pre-defined roles in an event or relation filled by an argument (entity mention). 
  • Linking: entity mentions linked to entries in the knowledge base as a method of indicating the real-world entity to which an entity referred.
2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. 

Tuesday, January 16, 2024

LDC January 2024 Newsletter

Renew your LDC membership today 

New publications:
 
KASET – Kurmanji and Sorani Kurdish Speech and Transcripts

LORELEI Farsi Representative Language Pack

_______________________________________________________________________

Renew your LDC membership today 

The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. 

LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 950+ holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2024, 2023 members receive a 10% discount on 2024 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 

New publications:

KASET - Kurmanji and Sorani Kurdish Speech and Transcripts consists of 147 hours of telephone conversations (289 recordings) and broadcast news (410 recordings) in two Kurdish dialects: Kurmanji Kurdish and Sorani Kurdish along with transcripts covering 60 hours of those recordings. Kurdish is spoken primarily in Turkey, Iran, Iraq, and Syria. Sorani and Kurmanji are the two widely spoken dialects of the Kurdish language.

The telephone speech was generated from calls by native Kurdish speakers in the United States to North American acquaintances in their social network. The broadcast news audio was collected from multiple streaming radio and television broadcast programs (narrowband and wideband audio), many of which contained a mix of Kurmanji and Sorani Kurdish. Native speaker auditors identified a 5-10 minute span from each broadcast recording for transcription. 

Full telephone recordings that passed the native speaker audit were transcribed. This release includes speaker information, such as gender, year of birth, and language.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

LORELEI Farsi Representative Language Pack was developed by LDC and is comprised of approximately 250 million words of Farsi monolingual text, 120,000 Farsi words translated from English data, and 751,000 words of found Farsi-English parallel text. Approximately 75,000 words were annotated for named entities and up to 22,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.