Linguistic Data Consortium: October 2025

Membership year 2026 publication preview

Fall 2025 data scholarship recipients

New publications:

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations

_____________________________________________________________

Membership year 2026 publication preview

The 2026 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of English conversational telephone speech following the Mixer collection protocol, used in NIST’s 2012 speaker recognition evaluation

KAIROS schema learning corpus background data and Phase 1 evaluation datasets: multimodal English and Spanish source data and annotations for reasoning about complex real-world events

CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational telephone speech from over 400 speakers to support text independent speaker recognition, used in the 2018 NIST Speaker Recognition Evaluation

Multi-language conversational telephone speech: multiple releases, hundreds of hours of speech from speakers of confusable linguistic varieties (Arabic, Chinese, English, French, Slavic, Spanish) to support language identification

CALLHOME Omnibus releases: combined speech and transcript datasets with updated directory structure, file formats and documentation, and lexicons (Chinese, English, German, Japanese, Spanish)

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)

Check your inbox for more information about membership renewal.

Fall 2025 data scholarship recipients

Congratulations to the recipients of LDC's Fall 2025 data scholarships:

Lasidu Dilshan: University of Moratuwa (Sri Lanka): BSc, Electronic and Telecommunication Engineering. Lasidu is awarded a copy of Asian Elephant Vocalizations LDC2010S05 for his work in elephant voice enhancement and classification.

Máté Gedeon: Budapest University of Technology and Economics (Hungary): PhD candidate, Department of Telecommunications and Artificial Intelligence. Máté is awarded a copy of Switchboard-1 Release 2 LDC97S62 for his work in simulated conversation generation.

Ping He: Northeastern University (USA): Student, Khoury College of Computer Sciences. Ping is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for their work in native language identification.

Thiyazen Iskander: Maulana Azad College of Arts, Science & Commerce (India), affiliated with Babasaheb Ambedkar Technological University (India): PhD candidate, Linguistics, Department of English. Thiyazen is awarded copies of Arabic Morphological Analyzer (SAMA) Version 3.1 LDC2010L01 and Arabic Treebank Part 1 v. 4.1 LDC2010T13 for his work in morphosyntactic analysis of short passives in Standard Arabic.

Michael Mooney: University of Glasgow (United Kingdom): PhD candidate, School of Computing Sciences. Michael is awarded copies of Treebank-2 LDC95T7 and BLLIP 1987-89 WSJ Corpus Release LDC2000T43 for their work in eye-tracking for text-centered modeling.

Abraham Sanders: Rensselaer Polytechnic Institute (USA): PhD candidate, Cognitive Science. Abraham is awarded a copy of Switchboard-1 Release 2 LDC97S62 for his work in spoken dialogue systems.

New publications:

KAIROS Phase 2 Quizlet was developed by LDC and contains English and Spanish text, video and image data and annotations used for pre-evaluation research and system development during Phase 2 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly defined tasks designed to explore specific evaluation objectives enabling KAIROS system developers to exercise individual system components on a small data set prior to the full program evaluation. This corpus contains the complete set of Quizlet data used in Phase 2 which focused on five real-world complex events within the Disease Outbreak scenario.

Source data was collected from the web; 66 root web pages were collected and processed, yielding 65 text data files, 890 image files and 10 video files. Annotation steps included labeling scenario-relevant events and relations for each document to develop a structured representation of temporally ordered events, relations and arguments; generating a reference knowledge graph; and linking labeled entries to a knowledge base derived from a Wikidata-based ontology.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio was developed by LDC and consists of 116 hours of speech from 274 unscripted telephone conversations between native speakers of the Arabic dialect spoken in Egypt. The calls were collected by LDC in the CALLFRIEND and CALLHOME series where participants called family members or close friends and spoke on topics of their choice. Around 33% of the recordings (92 calls) are publicly released for the first time. The remaining 182 recordings were previously published by LDC in various CALLFRIEND, CALLHOME and HUB5 Arabic datasets.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The material in this release represents the unannotated Egyptian Arabic source conversational telephone speech. The telephone data was transcribed, translated, and annotated for various tasks in the BOLT program including word alignment, treebanking, and co-reference.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations contains transcripts and corresponding English translations for the conversational telephone speech in BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio and was developed by LDC to support the DARPA BOLT program.

Transcribers were required to produce a verbatim transcript of all speech within a file using the CODA orthographic approach; diacritics were not included. Some transcripts contain redactions for potential personally identifying information. All speech data was transcribed and is divided into training, development, and evaluation partitions.

The goal of the BOLT translation task was to translate the Arabic transcripts into fluent English while preserving the meaning present in the original Arabic text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. 99% of the transcripts were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Linguistic Data Consortium

Wednesday, October 15, 2025

LDC October 2025 Newsletter