Linguistic Data Consortium: 2025

Monday, November 17, 2025

LDC November 2025 Newsletter

Join LDC for membership year 2026

Spring 2026 data scholarship application deadline

New publications:

LORELEI Ilocano Incident Language Pack
_______________________________________________________________

Join LDC for membership year 2026

It’s time to renew your LDC membership for 2026. Any organization that joins the Consortium or renews their membership before March 2, 2026, will receive a 10% discount off the membership fee.

In addition to accessing new publications, current LDC members enjoy the benefit of licensing at reduced fees older data from our Catalog of close to 1000 holdings. Current-year for-profit members may use most data for commercial applications.

Plans for next year’s publications are in progress. Among the expected releases are:

2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of English conversational telephone speech following the Mixer collection protocol, used in NIST’s 2012 speaker recognition evaluation

KAIROS schema learning corpus background data and Phase 1 evaluation datasets: multimodal English and Spanish source data and annotations for reasoning about complex real-world events

CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational telephone speech from over 400 speakers to support text independent speaker recognition, used in the 2018 NIST Speaker Recognition Evaluation

Multi-language conversational telephone speech: multiple releases, hundreds of hours of speech from speakers of confusable linguistic varieties (Arabic, Chinese, English, French, Slavic, Spanish) to support language identification

CALLHOME omnibus releases: combined speech and transcript datasets with updated directory structure, file formats and documentation, and lexicons (Chinese, English, German, Japanese, Spanish)

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2026 data scholarship application deadline
Applications are now being accepted through January 15, 2026 for the Spring 2026 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) CTS (Conversational Telephone Speech) Audio and Transcripts was developed by LDC, the Florida Institute of Technology and the University of New Haven to support algorithm development for predicting personality traits. It contains 242.52 hours of English telephone audio recordings and transcripts from 1,179 telephone calls involving 327 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

This corpus contains audio and transcripts for 277 participants and transcripts only for 50 participants. Telephone calls were collected using LDC's robot-operator platform. The operator called participants every 24 hours during their indicated availability and paired them with another participant to speak on a prompted topic for 10 minutes. Transcripts were produced automatically using the Rev.ai speech-to-text service.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Ilocano Incident Language Pack was developed by LDC and is comprised of 8.9 million words of Ilocano monolingual text, 3.3 million words of English monolingual text, 3.2 million words of parallel Ilocano-English text, and 3 million words annotated for entity discovery and linking and situation frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Ilocano language used in the DARPA LORELEI / LoReHLT 2019 Evaluation.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity discovery and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Wednesday, October 15, 2025

LDC October 2025 Newsletter

Membership year 2026 publication preview

Fall 2025 data scholarship recipients

New publications:

KAIROS Phase 2 Quizlet

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations

_____________________________________________________________

Membership year 2026 publication preview

The 2026 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of English conversational telephone speech following the Mixer collection protocol, used in NIST’s 2012 speaker recognition evaluation

KAIROS schema learning corpus background data and Phase 1 evaluation datasets: multimodal English and Spanish source data and annotations for reasoning about complex real-world events

CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational telephone speech from over 400 speakers to support text independent speaker recognition, used in the 2018 NIST Speaker Recognition Evaluation

Multi-language conversational telephone speech: multiple releases, hundreds of hours of speech from speakers of confusable linguistic varieties (Arabic, Chinese, English, French, Slavic, Spanish) to support language identification

CALLHOME Omnibus releases: combined speech and transcript datasets with updated directory structure, file formats and documentation, and lexicons (Chinese, English, German, Japanese, Spanish)

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)

Check your inbox for more information about membership renewal.

Fall 2025 data scholarship recipients

Congratulations to the recipients of LDC's Fall 2025 data scholarships:

Lasidu Dilshan: University of Moratuwa (Sri Lanka): BSc, Electronic and Telecommunication Engineering. Lasidu is awarded a copy of Asian Elephant Vocalizations LDC2010S05 for his work in elephant voice enhancement and classification.

Máté Gedeon: Budapest University of Technology and Economics (Hungary): PhD candidate, Department of Telecommunications and Artificial Intelligence. Máté is awarded a copy of Switchboard-1 Release 2 LDC97S62 for his work in simulated conversation generation.

Ping He: Northeastern University (USA): Student, Khoury College of Computer Sciences. Ping is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for their work in native language identification.

Thiyazen Iskander: Maulana Azad College of Arts, Science & Commerce (India), affiliated with Babasaheb Ambedkar Technological University (India): PhD candidate, Linguistics, Department of English. Thiyazen is awarded copies of Arabic Morphological Analyzer (SAMA) Version 3.1 LDC2010L01 and Arabic Treebank Part 1 v. 4.1 LDC2010T13 for his work in morphosyntactic analysis of short passives in Standard Arabic.

Michael Mooney: University of Glasgow (United Kingdom): PhD candidate, School of Computing Sciences. Michael is awarded copies of Treebank-2 LDC95T7 and BLLIP 1987-89 WSJ Corpus Release LDC2000T43 for their work in eye-tracking for text-centered modeling.

Abraham Sanders: Rensselaer Polytechnic Institute (USA): PhD candidate, Cognitive Science. Abraham is awarded a copy of Switchboard-1 Release 2 LDC97S62 for his work in spoken dialogue systems.

New publications:

KAIROS Phase 2 Quizlet was developed by LDC and contains English and Spanish text, video and image data and annotations used for pre-evaluation research and system development during Phase 2 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly defined tasks designed to explore specific evaluation objectives enabling KAIROS system developers to exercise individual system components on a small data set prior to the full program evaluation. This corpus contains the complete set of Quizlet data used in Phase 2 which focused on five real-world complex events within the Disease Outbreak scenario.

Source data was collected from the web; 66 root web pages were collected and processed, yielding 65 text data files, 890 image files and 10 video files. Annotation steps included labeling scenario-relevant events and relations for each document to develop a structured representation of temporally ordered events, relations and arguments; generating a reference knowledge graph; and linking labeled entries to a knowledge base derived from a Wikidata-based ontology.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio was developed by LDC and consists of 116 hours of speech from 274 unscripted telephone conversations between native speakers of the Arabic dialect spoken in Egypt. The calls were collected by LDC in the CALLFRIEND and CALLHOME series where participants called family members or close friends and spoke on topics of their choice. Around 33% of the recordings (92 calls) are publicly released for the first time. The remaining 182 recordings were previously published by LDC in various CALLFRIEND, CALLHOME and HUB5 Arabic datasets.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The material in this release represents the unannotated Egyptian Arabic source conversational telephone speech. The telephone data was transcribed, translated, and annotated for various tasks in the BOLT program including word alignment, treebanking, and co-reference.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations contains transcripts and corresponding English translations for the conversational telephone speech in BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio and was developed by LDC to support the DARPA BOLT program.

Transcribers were required to produce a verbatim transcript of all speech within a file using the CODA orthographic approach; diacritics were not included. Some transcripts contain redactions for potential personally identifying information. All speech data was transcribed and is divided into training, development, and evaluation partitions.

The goal of the BOLT translation task was to translate the Arabic transcripts into fluent English while preserving the meaning present in the original Arabic text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. 99% of the transcripts were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, September 15, 2025

LDC September 2025 Newsletter

LDC data and commercial technology development

New publications:

Mixer 7 English Speech

AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment

LORELEI Hindi Representative Language Pack

_____________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Mixer 7 English Speech was developed by LDC and contains 12,321 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 222 distinct English speakers. This material was collected by LDC in 2010-2011 as part of the Mixer project, and the recordings were used in the 2012 NIST SRE test set.

Recruited speakers were connected through a robot operator to carry on casual conversations on a pre-set topic lasting up to 10 minutes. Participants also visited LDC’s human subjects collection lab equipped with a 14-microphone array where they participated in interviews and transcript readings and conducted telephone calls under varying conditions. Selected speaker metadata was also collected.

2025 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment was developed by LDC and is comprised of English, Russian and Ukrainian web documents (text, video, image), annotations and assessments used in the AIDA Phase 1 pilot and final evaluations. The Phase 1 scenario focused on political relations between Russia and Ukraine in the 2010s. The material in this corpus covers the following events: Suspicious Deaths and Murders in Ukraine (January-April 2015); Odessa Tragedy (May 2, 2014); and Siege of Sloviansk and Battle of Kramatorsk (April-July 2014).

The corpus contains 10,522 documents, annotations for 386 of those documents, and assessment results covering 77,965 responses in 1,525 of those documents. Annotations were performed in three steps: (1) within-document labels for scenario-related entities, relations and events; (2) coreference annotation across documents by linking information elements to a knowledge base; and (3) indications of any relationship between labeled events/relations and hypotheses about the scenario. In the assessment phase, LDC annotators reviewed and judged system response files to provide evaluation organizers with a means for scoring submissions. Assessment tasks included zero-hop assessment, class-based assessment, graph assessment and hypothesis assessment.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Hindi Representative Language Pack contains over 26 million words of Hindi monolingual text, 363,00 words of which were translated into English, 1.07 million words of found Hindi-English parallel text, and 118,000 Hindi words translated from English data. Approximately 103,000 words were annotated for simple named entities and over 25,000 words were annotated for full entity (including nominals and pronouns), entity linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Friday, August 15, 2025

LDC August 2025 Newsletter

LDC at Interspeech 2025

Fall 2025 LDC data scholarship program

New publications:

Mixer 6 – ChiME 8 Transcribed Calls and Interviews

Abstract Meaning Representation 2.0 – Machine Translations

KAIROS Phase 1 Quizlet

________________________________________________________

LDC at Interspeech 2025
LDC will be exhibiting at Interspeech 2025, held this year August 17-21 in Rotterdam, the Netherlands. Stop by our booth to say hello and learn about the latest developments at the Consortium. Also be on the lookout for the following presentations, posters and special sessions featuring LDC work:

Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech Analysis
Monday, August 18, 11:00-13:00 - Area5-Oral1 - Speech Analysis, Detection and Classification 1

Reasoning-Based Approach with Chain-of-Thought for Alzheimer’s Detection Using Speech and Large Language Models
Tuesday, August 19, 13:30-15:30 - Area1-Poster2B - Databases and Progress in Methodology

Special Session: Challenges in Speech Collection, Curation and Annotation
Wednesday, August 20, 13:30-15:30 - Area14-SS7 – Part 1
Wednesday, August 20, 16:00-18:00 - Area14-SS8 – Part 2

TELVID: A Multilingual Multi-modal Corpus for Speaker Recognition
Thursday, August 21, 13:30-15:30 - AREA4-Oral8 – Speaker Recognition

LDC also supported the Interspeech 2025 URGENT Challenge which aims to bring more attention to constructing Universal, Robust and Generalizable speech EnhancemeNT models.

LDC will post conference updates via our social media platforms. We look forward to seeing you in Rotterdam!

Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

Mixer 6 - CHiME 8 Transcribed Calls and Interviews was developed for the 7th and 8th CHiME (Computational Hearing in Multisource Environments) challenges. It contains 80 hours of English interviews and telephone speech from Mixer 6 Speech (LDC2013S03) with transcripts developed for the CHiME challenges divided into training, development and test sets. This data was used in CHiME 7 Task 1 and CHiME 8 Task 1 both of which focused on transcription and segmentation across varied recording conditions such as interviews, meetings, and dinner parties, with an emphasis on generalization across recording device types and array topologies.

The data includes audio from Mixer 6 Speech recorded on 13 microphones for a total of 1063 hours (corresponding to 80 hours of speech). The development and test sets are speaker-disjoint from the training data and consist of fully transcribed, multi-microphone interviews. Each transcript segment was labeled with the speaker, the uttered text, and the start and end times in seconds for that segment.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Abstract Meaning Representation 2.0 - Machine Translations was developed at the University of Edinburgh, School of Informatics and the University of Zurich, Department of Computational Linguistics. It consists of Spanish, German, Italian and Mandarin Chinese automatic translations of the source English and professionally-translated Spanish, German, Italian and Mandarin Chinese sentences in Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07). The translations were collected through Google Translate between May 2018 and March 2024.

The source English sentences are a subset (1,371 sentences) of the sentences contained in Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10), a semantic treebank of over 39,000 English natural language sentences from broadcast conversations, newswire and web text.

Translations were from each of the five languages (English, Spanish, German, Italian and Mandarin Chinese) to the other four languages (Spanish, German, Italian and Mandarin Chinese) covering 20 language pairs. The dataset contains 1371 source sentences in each language, each with a professionally translated source sentence and multiple dated translations by Google Translate.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

KARIOS Phase 1 Quizlet was developed by LDC and contains English and Spanish text, video and image data and annotations used for pre-evaluation research and system development during Phase 1 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly defined tasks designed to explore specific evaluation objectives enabling KAIROS system developers to exercise individual system components on a small data set prior to the full program evaluation. This corpus contains the complete set of Quizlet data used in Phase 1 which focused on two real-world complex events (CEs) within the Improvised Explosive Device bombing scenario: CE1001 (2018 Caracas drone attack) and CE1002 (Utah High School backpack bombing).

Source data was collected from the web; 30 root web pages were collected and processed, yielding 29 text data files, 216 image files and 5 video files. Annotation steps included labeling scenario-relevant events and relations for each document to develop a structured representation of temporally ordered events, relations and arguments and generating a reference knowledge graph.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Tuesday, July 15, 2025

LDC July 2025 Newsletter

Fall 2025 LDC data scholarship program

New publications:

AnnoDIFP Session Audio and Transcripts
Penn Parsed Corpora of Historical English Second Release
LoReHLT Uzbek Representative Language Pack
_________________________________________________________________________

Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by LDC, the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorithm development for predicting personality traits. It contains 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

In-person interviews were recorded at LDC, FIT and UNH. In each session, the participant and interviewer were in separate sound-isolated rooms with communication between them supplied by audio/video hardware. Sessions consisted of the following tasks: rapport building, a YouTube task, a map task, and a business task. Further details on collection methodology and session tasks are contained in the documentation accompanying this release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Penn Parsed Corpora of Historical English Second Release was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This second release corrects errors and inconsistencies in Penn Parsed Corpora of Historical English (LDC2020T16), further streamlines annotation, simplifies the directory structure, and includes updated documentation.

This data set contains three corpora covering traditionally recognized periods of English:

The Penn-Helsinki Parsed Corpus of Middle English, second edition

The Penn-Helsinki Parsed Corpus of Early Modern English

The Penn Parsed Corpus of Modern British English, second edition

The texts are in two forms: part-of-speech tagged text and syntactically annotated text. Annotations were manually reviewed for accuracy and consistency. Included in this release are updated annotation guidelines, philological information for each corpus and the CorpusSearch 2 program, which allows users to search the data for words, word sequences and syntactic structure.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

LoReHLT Uzbek Representative Language Pack was developed by LDC and is comprised of approximately 47 million words of Uzbek monolingual text, 563,000 words of found Uzbek-English parallel text, 100,000 Uzbek words translated from English data, and 6.4 hours of Uzbek broadcast news and amateur web audio recordings. Approximately 151, 000 words were annotated for named entities and over 28,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 13,000 words. Over 20,890 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Data was collected from discussion forum, news, reference, social network, broadcast news, web audio recordings, and weblogs.

LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, June 16, 2025

LDC June 2025 Newsletter

LDC data and commercial technology development

New publications:

Chinese Sentence Pattern Structure Treebank
IWSLT 2022-2023 Shared Task Training, Development and Test Set
KAIROS Schema Learning Complex Event Annotation

_______________________________________________________________________

New publications:

Chinese Sentence Pattern Structure Treebank was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works. There are three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer tool which is included in the release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

IWSLT 2022 - 2023 Shared Task Training, Development and Test Set was developed by LDC and contains 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation. This material constitutes the training, development and test data used in the International Conference on Spoken Language Translation (IWSLT) Dialectal Speech Translation task (2022) and the Dialectal and Low-resource track (2023).

The telephone speech was collected by LDC in 2016-2017 from native speakers of Tunisian Arabic in Tunis. Speakers were recruited to make telephone calls to people in their social networks from a variety of noise conditions and handsets. Transcripts are orthographic following Buckwalter transliteration and cover 175 hours of the collected speech. IPA transcripts were added to a subset of the data.

All transcribed segments were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

KAIROS Schema Learning Complex Event Annotation was developed by LDC to support the DARPA KAIROS program. It contains English and Spanish text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance. Source data was collected from the web; 3431 root web pages were collected and processed, yielding 1919 text data files, 24019 image files, 1472 video files and 16 audio files.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, May 15, 2025

LDC May 2025 Newsletter

New publications:

BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio

BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations

___________________________________________________________

New Publications:

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio was developed by LDC and consists of 93 hours of speech from 236 unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. The calls were collected by LDC in the CALLFRIEND and CALLHOME series where participants called family members or close friends and spoke on topics of their choice. Around 60% of the recordings (141 calls) are publicly released for the first time. The remaining 95 recordings were previously published by LDC in various CALLFRIEND, CALLHOME and HUB5 Mandarin datasets. The data is divided into training, development, and evaluation partitions.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The material in this release represents the unannotated Chinese source conversational telephone speech. The telephone data was transcribed, translated, and annotated for various tasks in the BOLT program including word alignment, treebanking, and co-reference.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations contains transcripts and corresponding English translations for the conversational telephone speech in BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio and was developed by LDC to support the DARPA BOLT program.

Transcribers were required to produce a verbatim transcript of all speech within a file using simplified Chinese orthography and to add minimal markup to capture salient features of the speech. Some transcripts include redactions for potential personally identifying information. All speech data was transcribed and is divided into training, development, and evaluation partitions.

The goal of the BOLT translation task was to translate the Chinese transcripts into fluent English while preserving the meaning present in the original Chinese text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. 89% of the transcripts were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Tuesday, April 15, 2025

LDC April 2025 Newsletter

LDC launches upgraded, mobile-friendly website

Connect with LDC on Bluesky

New publications:
DEFT Spanish Light and Rich ERE Annotation

MATERIAL Kazakh-English Language Pack

____________________________________________________________

LDC launches upgraded, mobile-friendly website
We are pleased to announce the launch of the newly upgraded LDC main website: https://www.ldc.upenn.edu/. Designed with a modern layout, the site now offers an improved experience across all devices. While the LDC Catalog, LDC user accounts, and LDC Submissions are not affected by this upgrade, they are now more accessible than ever from any page on the site. We invite you to explore the website and enjoy a smoother, more intuitive LDC web experience.

Connect with LDC on Bluesky
In addition to Facebook, X and LinkedIn, you can now connect with LDC on the microblogging platform, Bluesky. Follow us today to learn the latest news, announcements and corpora releases from the Consortium.

New publications:

DEFT Spanish Light and Rich ERE Annotation was developed by LDC and consists of 158 Spanish discussion forum and newswire documents annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation, and event types between and among those entities including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation. The source data consists of Spanish newswire text and Latin American discussion forum data from DEFT Spanish Treebank LDC2018T01. 128 documents were annotated following Light ERE annotation guidelines. 154 files were labeled with Rich ERE annotation, 124 of which were also labeled with Light ERE annotation.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

MATERIAL Kazakh-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 17% of the speech files, all of which were translated into English. This release also includes English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.