Linguistic Data Consortium

Thursday, January 15, 2026

LDC January 2026 Newsletter

Renew your LDC membership today

New publications:

CALLHOME Japanese Lexicon Second Edition

MATERIAL Swahili-English Language Pack
_____________________________________________________________

Renew your LDC membership today
The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 1000 holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 2, 2026, any organization that joins the Consortium or renews their membership will receive a 10% discount off the 2026 membership fee. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits.

New publications:

CALLHOME Japanese Second Edition was developed by LDC and contains 49 hours of speech from 120 telephone conversations between native Japanese speakers. This publication is a re-release of the original CALLHOME Japanese collection, combining CALLHOME Japanese Speech (LDC96S37) and CALLHOME Japanese Transcripts (LDC96T18)with additional transcription and updated directory structure, file formats, and documentation.

This corpus contains the 120 calls from CALLHOME Japanese Speech which represented training and development data and a subset of evaluation data. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development/test partitioning was removed.

This release also features revised transcripts conforming to updated LDC transcription guidelines that addressed normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes.

The CALLHOME series consists of telephone conversations and transcripts developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME Japanese Lexicon Second Edition was developed by LDC and contains 80,688 Japanese words with morphological, phonological and stress information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME Japanese Lexicon (LDC96L17). The words in the lexicon were derived from 80 transcripts representing telephone conversations between native Japanese speakers contained in CALLHOME Japanese Second Edition (LDC2026S02).

The lexicon contains seven tab-separated information fields: (1) headword: orthographic form in kanji or katakana or hiragana (if only written in hiragana); (2) hiragana: orthographic form in hiragana; (3) romanization: orthographic form in romaji; (4) pron: pronunciation of the headword; (5) morph: morphological analysis of the headword; (6) train freq: frequency of the headword in the transcripts; and (7) gloss: glosses of the headword. This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format and the grapheme-to-phoneme (G2P) tools used to automatically generate pronunciations for the original lexicon.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

MATERIAL Swahili-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 112 hours of Swahili conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, 3% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Monday, December 15, 2025

LDC December 2025 Newsletter

LDC 2026 membership discounts now available

LDC’s 1000th corpus

Approaching deadline for Spring 2026 data scholarship applications

LDC closed for Winter Break December 25 – January 2

New publications:

2021 NIST Speaker Recognition Evaluation Development and Test Set

LORELEI Sinhala Incident Language Pack

_______________________________________________________________________

LDC 2026 membership discounts now available
Now through March 2, 2026, any organization that joins the Consortium or renews their membership will receive a 10% discount off the 2026 membership fee. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

LDC’s 1,000th corpus
LDC is delighted to announce the release of the 1,000th corpus into the Catalog! This milestone represents the commitment we made over thirty years ago to provide large quantities of diverse data, robust research program support and exceptional member services. We are grateful for the continued support and collaboration of our members, friends and the community.

Approaching deadline for Spring 2026 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2026 data scholarships are due January 15, 2026. For more information on requirements and program rules, see LDC Data Scholarships.

LDC closed for Winter Break December 25-January 2
LDC will be closed from Thursday, December 25, 2025 through Friday, January 2, 2026 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Monday, January 5, 2026. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:
2021 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains approximately 447 hours of Cantonese, Mandarin, and English conversational telephone speech, audio from video, and selfie image data for development and test, along with answer keys, enrollment, trial files and documentation from the NIST-sponsored 2021 Speaker Recognition Evaluation (SRE).

The SRE task is speaker detection, that is, to determine whether a specified target speaker was speaking during a segment of speech. SRE21 focused on telephone speech and audio from video and included close-up images of participants. The evaluation also featured cross-lingual trials, that is, enrollment and test segments spoken in different languages.

The data was drawn from the WeCanTalk corpus collected by LDC in which speakers called friends or relatives who agreed to record their telephone conversations lasting between 8-10 minutes. Subjects contributed multiple conversational telephone speech recordings and audio recordings in which they were talking, plus a single selfie image. Recordings were manually audited to verify speaker, language, and quality.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Sinhala Incident Language Pack was developed by LDC and is comprised of 8.1 million words of Sinhala monolingual text, 700,00 words of English monolingual text, 6.4 million words of parallel Sinhala- English text, and 50,000 words annotated for entity discovery and linking and situation frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Sinhala language used in the DARPA LORELEI / LoReHLT 2018 Evaluation.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity discovery and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, November 17, 2025

LDC November 2025 Newsletter

Join LDC for membership year 2026

Spring 2026 data scholarship application deadline

New publications:

AnnoDIFP CTS Audio and Transcripts

LORELEI Ilocano Incident Language Pack
_______________________________________________________________

Join LDC for membership year 2026

It’s time to renew your LDC membership for 2026. Any organization that joins the Consortium or renews their membership before March 2, 2026, will receive a 10% discount off the membership fee.

In addition to accessing new publications, current LDC members enjoy the benefit of licensing at reduced fees older data from our Catalog of close to 1000 holdings. Current-year for-profit members may use most data for commercial applications.

Plans for next year’s publications are in progress. Among the expected releases are:

2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of English conversational telephone speech following the Mixer collection protocol, used in NIST’s 2012 speaker recognition evaluation

KAIROS schema learning corpus background data and Phase 1 evaluation datasets: multimodal English and Spanish source data and annotations for reasoning about complex real-world events

CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational telephone speech from over 400 speakers to support text independent speaker recognition, used in the 2018 NIST Speaker Recognition Evaluation

Multi-language conversational telephone speech: multiple releases, hundreds of hours of speech from speakers of confusable linguistic varieties (Arabic, Chinese, English, French, Slavic, Spanish) to support language identification

CALLHOME omnibus releases: combined speech and transcript datasets with updated directory structure, file formats and documentation, and lexicons (Chinese, English, German, Japanese, Spanish)

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2026 data scholarship application deadline
Applications are now being accepted through January 15, 2026 for the Spring 2026 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) CTS (Conversational Telephone Speech) Audio and Transcripts was developed by LDC, the Florida Institute of Technology and the University of New Haven to support algorithm development for predicting personality traits. It contains 242.52 hours of English telephone audio recordings and transcripts from 1,179 telephone calls involving 327 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

This corpus contains audio and transcripts for 277 participants and transcripts only for 50 participants. Telephone calls were collected using LDC's robot-operator platform. The operator called participants every 24 hours during their indicated availability and paired them with another participant to speak on a prompted topic for 10 minutes. Transcripts were produced automatically using the Rev.ai speech-to-text service.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Ilocano Incident Language Pack was developed by LDC and is comprised of 8.9 million words of Ilocano monolingual text, 3.3 million words of English monolingual text, 3.2 million words of parallel Ilocano-English text, and 3 million words annotated for entity discovery and linking and situation frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Ilocano language used in the DARPA LORELEI / LoReHLT 2019 Evaluation.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Wednesday, October 15, 2025

LDC October 2025 Newsletter

Membership year 2026 publication preview

Fall 2025 data scholarship recipients

New publications:

KAIROS Phase 2 Quizlet

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations

_____________________________________________________________

Membership year 2026 publication preview

The 2026 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of English conversational telephone speech following the Mixer collection protocol, used in NIST’s 2012 speaker recognition evaluation

KAIROS schema learning corpus background data and Phase 1 evaluation datasets: multimodal English and Spanish source data and annotations for reasoning about complex real-world events

CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational telephone speech from over 400 speakers to support text independent speaker recognition, used in the 2018 NIST Speaker Recognition Evaluation

Multi-language conversational telephone speech: multiple releases, hundreds of hours of speech from speakers of confusable linguistic varieties (Arabic, Chinese, English, French, Slavic, Spanish) to support language identification

CALLHOME Omnibus releases: combined speech and transcript datasets with updated directory structure, file formats and documentation, and lexicons (Chinese, English, German, Japanese, Spanish)

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)

Check your inbox for more information about membership renewal.

Fall 2025 data scholarship recipients

Congratulations to the recipients of LDC's Fall 2025 data scholarships:

Lasidu Dilshan: University of Moratuwa (Sri Lanka): BSc, Electronic and Telecommunication Engineering. Lasidu is awarded a copy of Asian Elephant Vocalizations LDC2010S05 for his work in elephant voice enhancement and classification.

Máté Gedeon: Budapest University of Technology and Economics (Hungary): PhD candidate, Department of Telecommunications and Artificial Intelligence. Máté is awarded a copy of Switchboard-1 Release 2 LDC97S62 for his work in simulated conversation generation.

Ping He: Northeastern University (USA): Student, Khoury College of Computer Sciences. Ping is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for their work in native language identification.

Thiyazen Iskander: Maulana Azad College of Arts, Science & Commerce (India), affiliated with Babasaheb Ambedkar Technological University (India): PhD candidate, Linguistics, Department of English. Thiyazen is awarded copies of Arabic Morphological Analyzer (SAMA) Version 3.1 LDC2010L01 and Arabic Treebank Part 1 v. 4.1 LDC2010T13 for his work in morphosyntactic analysis of short passives in Standard Arabic.

Michael Mooney: University of Glasgow (United Kingdom): PhD candidate, School of Computing Sciences. Michael is awarded copies of Treebank-2 LDC95T7 and BLLIP 1987-89 WSJ Corpus Release LDC2000T43 for their work in eye-tracking for text-centered modeling.

Abraham Sanders: Rensselaer Polytechnic Institute (USA): PhD candidate, Cognitive Science. Abraham is awarded a copy of Switchboard-1 Release 2 LDC97S62 for his work in spoken dialogue systems.

New publications:

KAIROS Phase 2 Quizlet was developed by LDC and contains English and Spanish text, video and image data and annotations used for pre-evaluation research and system development during Phase 2 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly defined tasks designed to explore specific evaluation objectives enabling KAIROS system developers to exercise individual system components on a small data set prior to the full program evaluation. This corpus contains the complete set of Quizlet data used in Phase 2 which focused on five real-world complex events within the Disease Outbreak scenario.

Source data was collected from the web; 66 root web pages were collected and processed, yielding 65 text data files, 890 image files and 10 video files. Annotation steps included labeling scenario-relevant events and relations for each document to develop a structured representation of temporally ordered events, relations and arguments; generating a reference knowledge graph; and linking labeled entries to a knowledge base derived from a Wikidata-based ontology.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio was developed by LDC and consists of 116 hours of speech from 274 unscripted telephone conversations between native speakers of the Arabic dialect spoken in Egypt. The calls were collected by LDC in the CALLFRIEND and CALLHOME series where participants called family members or close friends and spoke on topics of their choice. Around 33% of the recordings (92 calls) are publicly released for the first time. The remaining 182 recordings were previously published by LDC in various CALLFRIEND, CALLHOME and HUB5 Arabic datasets.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The material in this release represents the unannotated Egyptian Arabic source conversational telephone speech. The telephone data was transcribed, translated, and annotated for various tasks in the BOLT program including word alignment, treebanking, and co-reference.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations contains transcripts and corresponding English translations for the conversational telephone speech in BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio and was developed by LDC to support the DARPA BOLT program.

Transcribers were required to produce a verbatim transcript of all speech within a file using the CODA orthographic approach; diacritics were not included. Some transcripts contain redactions for potential personally identifying information. All speech data was transcribed and is divided into training, development, and evaluation partitions.

The goal of the BOLT translation task was to translate the Arabic transcripts into fluent English while preserving the meaning present in the original Arabic text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. 99% of the transcripts were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, September 15, 2025

LDC September 2025 Newsletter

LDC data and commercial technology development

New publications:

Mixer 7 English Speech

AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment

LORELEI Hindi Representative Language Pack

_____________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Mixer 7 English Speech was developed by LDC and contains 12,321 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 222 distinct English speakers. This material was collected by LDC in 2010-2011 as part of the Mixer project, and the recordings were used in the 2012 NIST SRE test set.

Recruited speakers were connected through a robot operator to carry on casual conversations on a pre-set topic lasting up to 10 minutes. Participants also visited LDC’s human subjects collection lab equipped with a 14-microphone array where they participated in interviews and transcript readings and conducted telephone calls under varying conditions. Selected speaker metadata was also collected.

2025 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment was developed by LDC and is comprised of English, Russian and Ukrainian web documents (text, video, image), annotations and assessments used in the AIDA Phase 1 pilot and final evaluations. The Phase 1 scenario focused on political relations between Russia and Ukraine in the 2010s. The material in this corpus covers the following events: Suspicious Deaths and Murders in Ukraine (January-April 2015); Odessa Tragedy (May 2, 2014); and Siege of Sloviansk and Battle of Kramatorsk (April-July 2014).

The corpus contains 10,522 documents, annotations for 386 of those documents, and assessment results covering 77,965 responses in 1,525 of those documents. Annotations were performed in three steps: (1) within-document labels for scenario-related entities, relations and events; (2) coreference annotation across documents by linking information elements to a knowledge base; and (3) indications of any relationship between labeled events/relations and hypotheses about the scenario. In the assessment phase, LDC annotators reviewed and judged system response files to provide evaluation organizers with a means for scoring submissions. Assessment tasks included zero-hop assessment, class-based assessment, graph assessment and hypothesis assessment.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Hindi Representative Language Pack contains over 26 million words of Hindi monolingual text, 363,00 words of which were translated into English, 1.07 million words of found Hindi-English parallel text, and 118,000 Hindi words translated from English data. Approximately 103,000 words were annotated for simple named entities and over 25,000 words were annotated for full entity (including nominals and pronouns), entity linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Friday, August 15, 2025

LDC August 2025 Newsletter

LDC at Interspeech 2025

Fall 2025 LDC data scholarship program

New publications:

Mixer 6 – ChiME 8 Transcribed Calls and Interviews

Abstract Meaning Representation 2.0 – Machine Translations

KAIROS Phase 1 Quizlet

________________________________________________________

LDC at Interspeech 2025
LDC will be exhibiting at Interspeech 2025, held this year August 17-21 in Rotterdam, the Netherlands. Stop by our booth to say hello and learn about the latest developments at the Consortium. Also be on the lookout for the following presentations, posters and special sessions featuring LDC work:

Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech Analysis
Monday, August 18, 11:00-13:00 - Area5-Oral1 - Speech Analysis, Detection and Classification 1

Reasoning-Based Approach with Chain-of-Thought for Alzheimer’s Detection Using Speech and Large Language Models
Tuesday, August 19, 13:30-15:30 - Area1-Poster2B - Databases and Progress in Methodology

Special Session: Challenges in Speech Collection, Curation and Annotation
Wednesday, August 20, 13:30-15:30 - Area14-SS7 – Part 1
Wednesday, August 20, 16:00-18:00 - Area14-SS8 – Part 2

TELVID: A Multilingual Multi-modal Corpus for Speaker Recognition
Thursday, August 21, 13:30-15:30 - AREA4-Oral8 – Speaker Recognition

LDC also supported the Interspeech 2025 URGENT Challenge which aims to bring more attention to constructing Universal, Robust and Generalizable speech EnhancemeNT models.

LDC will post conference updates via our social media platforms. We look forward to seeing you in Rotterdam!

Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

Mixer 6 - CHiME 8 Transcribed Calls and Interviews was developed for the 7th and 8th CHiME (Computational Hearing in Multisource Environments) challenges. It contains 80 hours of English interviews and telephone speech from Mixer 6 Speech (LDC2013S03) with transcripts developed for the CHiME challenges divided into training, development and test sets. This data was used in CHiME 7 Task 1 and CHiME 8 Task 1 both of which focused on transcription and segmentation across varied recording conditions such as interviews, meetings, and dinner parties, with an emphasis on generalization across recording device types and array topologies.

The data includes audio from Mixer 6 Speech recorded on 13 microphones for a total of 1063 hours (corresponding to 80 hours of speech). The development and test sets are speaker-disjoint from the training data and consist of fully transcribed, multi-microphone interviews. Each transcript segment was labeled with the speaker, the uttered text, and the start and end times in seconds for that segment.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Abstract Meaning Representation 2.0 - Machine Translations was developed at the University of Edinburgh, School of Informatics and the University of Zurich, Department of Computational Linguistics. It consists of Spanish, German, Italian and Mandarin Chinese automatic translations of the source English and professionally-translated Spanish, German, Italian and Mandarin Chinese sentences in Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07). The translations were collected through Google Translate between May 2018 and March 2024.

The source English sentences are a subset (1,371 sentences) of the sentences contained in Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10), a semantic treebank of over 39,000 English natural language sentences from broadcast conversations, newswire and web text.

Translations were from each of the five languages (English, Spanish, German, Italian and Mandarin Chinese) to the other four languages (Spanish, German, Italian and Mandarin Chinese) covering 20 language pairs. The dataset contains 1371 source sentences in each language, each with a professionally translated source sentence and multiple dated translations by Google Translate.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

KARIOS Phase 1 Quizlet was developed by LDC and contains English and Spanish text, video and image data and annotations used for pre-evaluation research and system development during Phase 1 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly defined tasks designed to explore specific evaluation objectives enabling KAIROS system developers to exercise individual system components on a small data set prior to the full program evaluation. This corpus contains the complete set of Quizlet data used in Phase 1 which focused on two real-world complex events (CEs) within the Improvised Explosive Device bombing scenario: CE1001 (2018 Caracas drone attack) and CE1002 (Utah High School backpack bombing).

Source data was collected from the web; 30 root web pages were collected and processed, yielding 29 text data files, 216 image files and 5 video files. Annotation steps included labeling scenario-relevant events and relations for each document to develop a structured representation of temporally ordered events, relations and arguments and generating a reference knowledge graph.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Tuesday, July 15, 2025

LDC July 2025 Newsletter

Fall 2025 LDC data scholarship program

New publications:

AnnoDIFP Session Audio and Transcripts
Penn Parsed Corpora of Historical English Second Release
LoReHLT Uzbek Representative Language Pack
_________________________________________________________________________

Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by LDC, the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorithm development for predicting personality traits. It contains 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

In-person interviews were recorded at LDC, FIT and UNH. In each session, the participant and interviewer were in separate sound-isolated rooms with communication between them supplied by audio/video hardware. Sessions consisted of the following tasks: rapport building, a YouTube task, a map task, and a business task. Further details on collection methodology and session tasks are contained in the documentation accompanying this release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Penn Parsed Corpora of Historical English Second Release was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This second release corrects errors and inconsistencies in Penn Parsed Corpora of Historical English (LDC2020T16), further streamlines annotation, simplifies the directory structure, and includes updated documentation.

This data set contains three corpora covering traditionally recognized periods of English:

The Penn-Helsinki Parsed Corpus of Middle English, second edition

The Penn-Helsinki Parsed Corpus of Early Modern English

The Penn Parsed Corpus of Modern British English, second edition

The texts are in two forms: part-of-speech tagged text and syntactically annotated text. Annotations were manually reviewed for accuracy and consistency. Included in this release are updated annotation guidelines, philological information for each corpus and the CorpusSearch 2 program, which allows users to search the data for words, word sequences and syntactic structure.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

LoReHLT Uzbek Representative Language Pack was developed by LDC and is comprised of approximately 47 million words of Uzbek monolingual text, 563,000 words of found Uzbek-English parallel text, 100,000 Uzbek words translated from English data, and 6.4 hours of Uzbek broadcast news and amateur web audio recordings. Approximately 151, 000 words were annotated for named entities and over 28,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 13,000 words. Over 20,890 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Data was collected from discussion forum, news, reference, social network, broadcast news, web audio recordings, and weblogs.

LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.