Linguistic Data Consortium: 2026

Monday, February 16, 2026

LDC February 2026 Newsletter

LDC membership discounts expire March 2

Spring 2026 data scholarship recipient

New publications:

2022 NIST Language Recognition Evaluation Test and Development Sets

KAIROS Schema Learning Background Source Data

LORELEI Russian Representative Language Pack

_________________________________________________________________

LDC membership discounts expire March 2

Time is running out to save on 2026 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 2 to receive a 10% discount. For more information on membership benefits and options, visit Join LDC.

Spring 2026 data scholarship recipient

Congratulations to the recipient of LDC’s Spring 2026 data scholarship:

Doma Akshitha Reddy: Chaitanya Bharathi Institute of Technology (India): Bachelor of Engineering, Information Technology. Doma is awarded copies of TIMIT Acoustic-Phonetic Continuous Speech Corpus and The CMU Kids Corpus for their work in child speech.

Since 2010, LDC has awarded scholarships to successful student applicants twice each year. To date more than 242 corpora have been distributed to 162 students across 38 countries. We proudly celebrate their achievements and the contributions their research has made to the broader community.

The next round of applications will be accepted in September 2026. For information about the program, visit the Data Scholarships page.

New publications:

2022 NIST Language Recognition Evaluation Test and Development Sets was developed by LDC and NIST and contains the test and development data, metadata, answer keys, and documentation for the 2022 NIST Language Recognition Evaluation (LRE22). The source data is comprised of 222 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) in 14 languages: Afrikaans, Tunisian Arabic, Algerian Arabic, Libyan Arabic, South African English, Indian-accented South African English, North African French, Ndebele, Oromo, Tigrinya, Tsonga, Venda, Xhosa and Zulu.

For the CTS collections, a small number of native speakers made single calls to multiple individuals in their social network. Calls lasted 8-15 minutes; speakers were free to discuss any topic. The BNBS data was collected from streaming radio programming, focused on broadcasts that included narrowband speech (e.g., call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality.

LRE22 emphasized language recognition for African languages, including low resource languages, and expanded the range of test segment durations. Further information about the 2022 evaluation can be found in the 2022 NIST Language Recognition Evaluation Plan.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

KAIROS Schema Learning Background Source Data was developed by LDC and includes 14,000 English and Spanish documents representing text, audio, video, image, and multimedia resources collected during the DARPA KAIROS program as supplemental background source data for the KAIROS Schema Learning Corpus (SLC). The purpose of the supplemental collection was to increase the amount of English and Spanish data with multimedia components for schema learning and to add domains not well represented in existing Spanish data. The supplemental data in this release includes material from the business and logistics domains, instructional documents and multimedia news.

The complete set of SLC background source data (including the data in this publication) totaled 16.2 million English, Russian and Spanish documents and more than 125,000 audio, video, image, or multimedia resources. A large portion of that data was drawn from pre-existing LDC datasets.

The SLC and KAIROS Schema Learning Complex Event Annotation (LDC2025T07), containing English and Spanish text, audio, video, and image material labeled for 93 real-world complex events, constitute the data used by KAIROS system developers for schema learning.

KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Russian Representative Language Pack contains over 1.26 billion words of Russian monolingual text, 360,00 words of which were translated into English, 3 million words of found Russian-English parallel text, and 87,000 Russian words translated from English data. Approximately 83,000 words were annotated for simple named entities, around 26,000 words were annotated for full entity (including nominals and pronouns), entity linking and situation frames (identifying entities, needs and issues) and nearly 9,000 words were covered by noun phrase chunking annotation. Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, January 15, 2026

LDC January 2026 Newsletter

Renew your LDC membership today

New publications:

CALLHOME Japanese Second Edition

CALLHOME Japanese Lexicon Second Edition

MATERIAL Swahili-English Language Pack
_____________________________________________________________

Renew your LDC membership today
The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 1000 holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 2, 2026, any organization that joins the Consortium or renews their membership will receive a 10% discount off the 2026 membership fee. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits.

New publications:

CALLHOME Japanese Second Edition was developed by LDC and contains 49 hours of speech from 120 telephone conversations between native Japanese speakers. This publication is a re-release of the original CALLHOME Japanese collection, combining CALLHOME Japanese Speech (LDC96S37) and CALLHOME Japanese Transcripts (LDC96T18)with additional transcription and updated directory structure, file formats, and documentation.

This corpus contains the 120 calls from CALLHOME Japanese Speech which represented training and development data and a subset of evaluation data. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development/test partitioning was removed.

This release also features revised transcripts conforming to updated LDC transcription guidelines that addressed normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes.

The CALLHOME series consists of telephone conversations and transcripts developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME Japanese Lexicon Second Edition was developed by LDC and contains 80,688 Japanese words with morphological, phonological and stress information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME Japanese Lexicon (LDC96L17). The words in the lexicon were derived from 80 transcripts representing telephone conversations between native Japanese speakers contained in CALLHOME Japanese Second Edition (LDC2026S02).

The lexicon contains seven tab-separated information fields: (1) headword: orthographic form in kanji or katakana or hiragana (if only written in hiragana); (2) hiragana: orthographic form in hiragana; (3) romanization: orthographic form in romaji; (4) pron: pronunciation of the headword; (5) morph: morphological analysis of the headword; (6) train freq: frequency of the headword in the transcripts; and (7) gloss: glosses of the headword. This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format and the grapheme-to-phoneme (G2P) tools used to automatically generate pronunciations for the original lexicon.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

MATERIAL Swahili-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 112 hours of Swahili conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, 3% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.