Linguistic Data Consortium: November 2025

Join LDC for membership year 2026

Spring 2026 data scholarship application deadline

New publications:

LORELEI Ilocano Incident Language Pack
_______________________________________________________________

Join LDC for membership year 2026

It’s time to renew your LDC membership for 2026. Any organization that joins the Consortium or renews their membership before March 2, 2026, will receive a 10% discount off the membership fee.

In addition to accessing new publications, current LDC members enjoy the benefit of licensing at reduced fees older data from our Catalog of close to 1000 holdings. Current-year for-profit members may use most data for commercial applications.

Plans for next year’s publications are in progress. Among the expected releases are:

2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of English conversational telephone speech following the Mixer collection protocol, used in NIST’s 2012 speaker recognition evaluation

KAIROS schema learning corpus background data and Phase 1 evaluation datasets: multimodal English and Spanish source data and annotations for reasoning about complex real-world events

CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational telephone speech from over 400 speakers to support text independent speaker recognition, used in the 2018 NIST Speaker Recognition Evaluation

Multi-language conversational telephone speech: multiple releases, hundreds of hours of speech from speakers of confusable linguistic varieties (Arabic, Chinese, English, French, Slavic, Spanish) to support language identification

CALLHOME omnibus releases: combined speech and transcript datasets with updated directory structure, file formats and documentation, and lexicons (Chinese, English, German, Japanese, Spanish)

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2026 data scholarship application deadline
Applications are now being accepted through January 15, 2026 for the Spring 2026 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) CTS (Conversational Telephone Speech) Audio and Transcripts was developed by LDC, the Florida Institute of Technology and the University of New Haven to support algorithm development for predicting personality traits. It contains 242.52 hours of English telephone audio recordings and transcripts from 1,179 telephone calls involving 327 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

This corpus contains audio and transcripts for 277 participants and transcripts only for 50 participants. Telephone calls were collected using LDC's robot-operator platform. The operator called participants every 24 hours during their indicated availability and paired them with another participant to speak on a prompted topic for 10 minutes. Transcripts were produced automatically using the Rev.ai speech-to-text service.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Ilocano Incident Language Pack was developed by LDC and is comprised of 8.9 million words of Ilocano monolingual text, 3.3 million words of English monolingual text, 3.2 million words of parallel Ilocano-English text, and 3 million words annotated for entity discovery and linking and situation frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Ilocano language used in the DARPA LORELEI / LoReHLT 2019 Evaluation.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity discovery and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, November 17, 2025

LDC November 2025 Newsletter