Spring 2026 data scholarship recipient
New publications:
2022 NIST Language Recognition Evaluation Test and Development Sets
KAIROS Schema Learning Background Source Data
LORELEI Russian Representative Language Pack
_________________________________________________________________
Spring 2026 data scholarship recipient
Congratulations to the recipient of LDC’s Spring 2026 data scholarship:
Doma Akshitha Reddy: Chaitanya Bharathi Institute of Technology (India): Bachelor of Engineering, Information Technology. Doma is awarded copies of TIMIT Acoustic-Phonetic Continuous Speech Corpus and The CMU Kids Corpus for their work in child speech.
Since 2010, LDC has awarded scholarships to successful student applicants twice each year. To date more than 242 corpora have been distributed to 162 students across 38 countries. We proudly celebrate their achievements and the contributions their research has made to the broader community.
The next round of applications will be accepted in September 2026. For information about the program, visit the Data Scholarships page.
2022 NIST Language Recognition Evaluation Test and Development Sets was developed by LDC and NIST and contains the test and development data, metadata, answer keys, and documentation for the 2022 NIST Language Recognition Evaluation (LRE22). The source data is comprised of 222 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) in 14 languages: Afrikaans, Tunisian Arabic, Algerian Arabic, Libyan Arabic, South African English, Indian-accented South African English, North African French, Ndebele, Oromo, Tigrinya, Tsonga, Venda, Xhosa and Zulu.
For the CTS collections, a small number of native speakers made single calls to multiple individuals in their social network. Calls lasted 8-15 minutes; speakers were free to discuss any topic. The BNBS data was collected from streaming radio programming, focused on broadcasts that included narrowband speech (e.g., call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality.
LRE22 emphasized language recognition for African languages, including low resource languages, and expanded the range of test segment durations. Further information about the 2022 evaluation can be found in the 2022 NIST Language Recognition Evaluation Plan.
2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
KAIROS Schema Learning Background Source Data was developed by LDC and includes 14,000 English and Spanish documents representing text, audio, video, image, and multimedia resources collected during the DARPA KAIROS program as supplemental background source data for the KAIROS Schema Learning Corpus (SLC). The purpose of the supplemental collection was to increase the amount of English and Spanish data with multimedia components for schema learning and to add domains not well represented in existing Spanish data. The supplemental data in this release includes material from the business and logistics domains, instructional documents and multimedia news.
The complete set of SLC background source data (including the data in this publication) totaled 16.2 million English, Russian and Spanish documents and more than 125,000 audio, video, image, or multimedia resources. A large portion of that data was drawn from pre-existing LDC datasets.
The SLC and KAIROS Schema Learning Complex Event Annotation (LDC2025T07), containing English and Spanish text, audio, video, and image material labeled for 93 real-world complex events, constitute the data used by KAIROS system developers for schema learning.
KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.
2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
LORELEI Russian Representative Language Pack contains over 1.26 billion words of Russian monolingual text, 360,00 words of which were translated into English, 3 million words of found Russian-English parallel text, and 87,000 Russian words translated from English data. Approximately 83,000 words were annotated for simple named entities, around 26,000 words were annotated for full entity (including nominals and pronouns), entity linking and situation frames (identifying entities, needs and issues) and nearly 9,000 words were covered by noun phrase chunking annotation. Data was collected from discussion forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).
2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.