Tuesday, July 15, 2025

LDC July 2025 Newsletter

Fall 2025 LDC data scholarship program 

New publications:
AnnoDIFP Session Audio and Transcripts 
Penn Parsed Corpora of Historical English Second Release
LoReHLT Uzbek Representative Language Pack
_________________________________________________________________________
 
Fall 2025 LDC data scholarship program 
Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by LDC, the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorithm development for predicting personality traits. It contains 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

In-person interviews were recorded at LDC, FIT and UNH. In each session, the participant and interviewer were in separate sound-isolated rooms with communication between them supplied by audio/video hardware. Sessions consisted of the following tasks: rapport building, a YouTube task, a map task, and a business task. Further details on collection methodology and session tasks are contained in the documentation accompanying this release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

Penn Parsed Corpora of Historical English Second Release was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This second release corrects errors and inconsistencies in Penn Parsed Corpora of Historical English (LDC2020T16), further streamlines annotation, simplifies the directory structure, and includes updated documentation.

This data set contains three corpora covering traditionally recognized periods of English:

  • The Penn-Helsinki Parsed Corpus of Middle English, second edition
  • The Penn-Helsinki Parsed Corpus of Early Modern English
  • The Penn Parsed Corpus of Modern British English, second edition

The texts are in two forms: part-of-speech tagged text and syntactically annotated text. Annotations were manually reviewed for accuracy and consistency. Included in this release are updated annotation guidelines, philological information for each corpus and the CorpusSearch 2 program, which allows users to search the data for words, word sequences and syntactic structure. 

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

*

LoReHLT Uzbek Representative Language Pack was developed by LDC and is comprised of approximately 47 million words of Uzbek monolingual text, 563,000 words of found Uzbek-English parallel text, 100,000 Uzbek words translated from English data, and 6.4 hours of Uzbek broadcast news and amateur web audio recordings. Approximately 151, 000 words were annotated for named entities and over 28,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 13,000 words. Over 20,890 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Data was collected from discussion forum, news, reference, social network, broadcast news, web audio recordings, and weblogs.

LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.



Monday, June 16, 2025

LDC June 2025 Newsletter

LDC data and commercial technology development


New publications:

Chinese Sentence Pattern Structure Treebank
IWSLT 2022-2023 Shared Task Training, Development and Test Set
KAIROS Schema Learning Complex Event Annotation

_______________________________________________________________________


LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Chinese Sentence Pattern Structure Treebank was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works. There are three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer tool which is included in the release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

IWSLT 2022 - 2023 Shared Task Training, Development and Test Set was developed by LDC and contains 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation. This material constitutes the training, development and test data used in the International Conference on Spoken Language Translation (IWSLT) Dialectal Speech Translation task (2022) and the Dialectal and Low-resource track (2023).

The telephone speech was collected by LDC in 2016-2017 from native speakers of Tunisian Arabic in Tunis. Speakers were recruited to make telephone calls to people in their social networks from a variety of noise conditions and handsets. Transcripts are orthographic following Buckwalter transliteration and cover 175 hours of the collected speech. IPA transcripts were added to a subset of the data. 

All transcribed segments were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

KAIROS Schema Learning Complex Event Annotation was developed by LDC to support the DARPA KAIROS program. It contains English and Spanish text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance. Source data was collected from the web; 3431 root web pages were collected and processed, yielding 1919 text data files, 24019 image files, 1472 video files and 16 audio files.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.


Thursday, May 15, 2025

LDC May 2025 Newsletter

New publications:

BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio

BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations

___________________________________________________________

New Publications: 

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio was developed by LDC and consists of 93 hours of speech from 236 unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. The calls were collected by LDC in the CALLFRIEND and CALLHOME series where participants called family members or close friends and spoke on topics of their choice. Around 60% of the recordings (141 calls) are publicly released for the first time. The remaining 95 recordings were previously published by LDC in various CALLFRIEND, CALLHOME and HUB5 Mandarin datasets. The data is divided into training, development, and evaluation partitions.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The material in this release represents the unannotated Chinese source conversational telephone speech. The telephone data was transcribed, translated, and annotated for various tasks in the BOLT program including word alignment, treebanking, and co-reference.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations contains transcripts and corresponding English translations for the conversational telephone speech in BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio and was developed by LDC to support the DARPA BOLT program. 

Transcribers were required to produce a verbatim transcript of all speech within a file using simplified Chinese orthography and to add minimal markup to capture salient features of the speech. Some transcripts include redactions for potential personally identifying information. All speech data was transcribed and is divided into training, development, and evaluation partitions.

The goal of the BOLT translation task was to translate the Chinese transcripts into fluent English while preserving the meaning present in the original Chinese text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. 89% of the transcripts were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. 

Tuesday, April 15, 2025

LDC April 2025 Newsletter

LDC launches upgraded, mobile-friendly website

Connect with LDC on Bluesky


New publications:
DEFT Spanish Light and Rich ERE Annotation

MATERIAL Kazakh-English Language Pack

____________________________________________________________


LDC launches upgraded, mobile-friendly website
We are pleased to announce the launch of the newly upgraded LDC main website: https://www.ldc.upenn.edu/. Designed with a modern layout, the site now offers an improved experience across all devices. While the LDC Catalog, LDC user accounts, and LDC Submissions are not affected by this upgrade, they are now more accessible than ever from any page on the site. We invite you to explore the website and enjoy a smoother, more intuitive LDC web experience. 

Connect with LDC on Bluesky
In addition to Facebook, X and LinkedIn, you can now connect with LDC on the microblogging platform, Bluesky. Follow us today to learn the latest news, announcements and corpora releases from the Consortium. 


New publications:

DEFT Spanish Light and Rich ERE Annotation was developed by LDC and consists of 158 Spanish discussion forum and newswire documents annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation, and event types between and among those entities including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation. The source data consists of Spanish newswire text and Latin American discussion forum data from DEFT Spanish Treebank LDC2018T01. 128 documents were annotated following Light ERE annotation guidelines. 154 files were labeled with Rich ERE annotation, 124 of which were also labeled with Light ERE annotation.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

MATERIAL Kazakh-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 17% of the speech files, all of which were translated into English. This release also includes English queries and their relevance annotations. 


The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.