Monday, December 15, 2025

LDC December 2025 Newsletter

LDC 2026 membership discounts now available

LDC’s 1000th corpus

Approaching deadline for Spring 2026 data scholarship applications 

LDC closed for Winter Break December 25 – January 2  

New publications:

2021 NIST Speaker Recognition Evaluation Development and Test Set

LORELEI Sinhala Incident Language Pack

_______________________________________________________________________

LDC 2026 membership discounts now available 
Now through March 2, 2026, any organization that joins the Consortium or renews their membership will receive a 10% discount off the 2026 membership fee. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

LDC’s 1,000th corpus
LDC is delighted to announce the release of the 1,000th corpus into the Catalog! This milestone represents the commitment we made over thirty years ago to provide large quantities of diverse data, robust research program support and exceptional member services. We are grateful for the continued support and collaboration of our members, friends and the community.  

Approaching deadline for Spring 2026 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2026 data scholarships are due January 15, 2026. For more information on requirements and program rules, see LDC Data Scholarships

LDC closed for Winter Break December 25-January 2 
LDC will be closed from Thursday, December 25, 2025 through Friday, January 2, 2026 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Monday, January 5, 2026. Requests received by the Membership Office during Winter Break will be processed when the office reopens. 
 
New publications:
2021 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains approximately 447 hours of Cantonese, Mandarin, and English conversational telephone speech, audio from video, and selfie image data for development and test, along with answer keys, enrollment, trial files and documentation from the NIST-sponsored 2021 Speaker Recognition Evaluation (SRE).

The SRE task is speaker detection, that is, to determine whether a specified target speaker was speaking during a segment of speech. SRE21 focused on telephone speech and audio from video and included close-up images of participants. The evaluation also featured cross-lingual trials, that is, enrollment and test segments spoken in different languages.

The data was drawn from the WeCanTalk corpus collected by LDC in which speakers called friends or relatives who agreed to record their telephone conversations lasting between 8-10 minutes. Subjects contributed multiple conversational telephone speech recordings and audio recordings in which they were talking, plus a single selfie image. Recordings were manually audited to verify speaker, language, and quality.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

LORELEI Sinhala Incident Language Pack was developed by LDC and is comprised of 8.1 million words of Sinhala monolingual text, 700,00 words of English monolingual text, 6.4 million words of parallel Sinhala- English text, and 50,000 words annotated for entity discovery and linking and situation frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Sinhala language used in the DARPA LORELEI / LoReHLT 2018 Evaluation.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity discovery and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.