Thursday, February 15, 2024

LDC February 2024 Newsletter

LDC membership discounts expire March 1 

Spring 2024 data scholarship recipients

Four corpora withdrawn from the LDC Catalog

New publications:

Second Language University Speech Intelligibility Corpus

AIDA Scenario 1 Practice Topic Annotation

_________________________________________________________________

LDC membership discounts expire March 1 

Time is running out to save on 2024 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

Spring 2024 data scholarship recipients 

Congratulations to the recipients of LDC’s Spring 2024 data scholarships:

Jordan Chandler: Université Rennes 2 (France): Master’s student, English Studies. Jordan is awarded a copy of Penn Parsed Corpora of Historical English LDC2020T16 to continue his research on the historical development of adjective, quantifier and article indefiniteness in the English language.

Nikhil Raghav: TCG Crest (India): PhD candidate, Institute for Advancing Intelligence. Nikhil is awarded copies of Third DIHARD Challenge Development LDC2022S12 and Third DIHARD Challenge Evaluation LDC2022S14 for his work in speaker diarization. 

Abraham Sanders: Rensselaer Polytechnical Institute (USA): PhD candidate, Cognitive Science. Abraham is awarded copies of Fisher English Training Speech Part 1 Speech LDC2004S13, Fisher English Training Speech Part 1 Transcripts LDC2004T19, Fisher English Training Part 2 Speech LDC2005S13 and Fisher English Training Part 2 Transcripts LDC2005T19, for his work in spoken dialogue systems.  

The next round of applications will be accepted in September 2024. For information about the program, visit the Data Scholarships page.

Four corpora withdrawn from the LDC Catalog 

We regret to announce that The New York Times Annotated Corpus LDC2008T19 has been withdrawn from the LDC Catalog by the data provider. Because they contain data from LDC2008T19, the following three corpora are also withdrawn from the Catalog: Benchmarks for Open Relation Extraction LDC2014T27, Concretely Annotated New York Times LDC2018T12, and News Sub-domain Named Entity Recognition LDC2023T12. Organizations and individuals who have previously licensed any of these data sets can continue to use them under the terms of their respective special license agreements.

New publications:
 
Second Language University Speech Intelligibility Corpus was developed by Northern Arizona University, The Pennsylvania State University, and The University of Texas at Dallas. It contains 10.5 hours of English speech collected from 66 international faculty and university students representing 15 language backgrounds at 10 North American universities. This release also includes orthographic transcriptions for all recordings, intelligibility scores for 73% of the files, speaker metadata, and aligned Praat textgrids. 
 
The speech data is comprised of presentations, descriptions, reflections, and microteaching tasks. Speakers were recruited from courses at intensive English programs and oral skills courses for international graduate students seeking to become international teaching assistants. 
 
2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

*

AIDA Scenario 1 Practice Topic Annotation was developed by LDC and is comprised of annotations for 212 English, Russian and Ukrainian web documents (text, image and video) from AIDA Scenario 1 Practice Topic Source Data (LDC2023T11), specifically, the set of practice documents designated for annotation in Phase 1.

Annotations are presented as tab separated files in the following categories for each topic:

  • Mentions: single references in source data to a real-world entity or filler, event, or relation. 
  • Slots: pre-defined roles in an event or relation filled by an argument (entity mention). 
  • Linking: entity mentions linked to entries in the knowledge base as a method of indicating the real-world entity to which an entity referred.
2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. 

Tuesday, January 16, 2024

LDC January 2024 Newsletter

Renew your LDC membership today 

New publications:
 
KASET – Kurmanji and Sorani Kurdish Speech and Transcripts

LORELEI Farsi Representative Language Pack

_______________________________________________________________________

Renew your LDC membership today 

The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. 

LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 950+ holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2024, 2023 members receive a 10% discount on 2024 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 

New publications:

KASET - Kurmanji and Sorani Kurdish Speech and Transcripts consists of 147 hours of telephone conversations (289 recordings) and broadcast news (410 recordings) in two Kurdish dialects: Kurmanji Kurdish and Sorani Kurdish along with transcripts covering 60 hours of those recordings. Kurdish is spoken primarily in Turkey, Iran, Iraq, and Syria. Sorani and Kurmanji are the two widely spoken dialects of the Kurdish language.

The telephone speech was generated from calls by native Kurdish speakers in the United States to North American acquaintances in their social network. The broadcast news audio was collected from multiple streaming radio and television broadcast programs (narrowband and wideband audio), many of which contained a mix of Kurmanji and Sorani Kurdish. Native speaker auditors identified a 5-10 minute span from each broadcast recording for transcription. 

Full telephone recordings that passed the native speaker audit were transcribed. This release includes speaker information, such as gender, year of birth, and language.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

LORELEI Farsi Representative Language Pack was developed by LDC and is comprised of approximately 250 million words of Farsi monolingual text, 120,000 Farsi words translated from English data, and 751,000 words of found Farsi-English parallel text. Approximately 75,000 words were annotated for named entities and up to 22,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.