Linguistic Data Consortium: historical English text

Tuesday, July 15, 2025

LDC July 2025 Newsletter

Fall 2025 LDC data scholarship program

New publications:

AnnoDIFP Session Audio and Transcripts
Penn Parsed Corpora of Historical English Second Release
LoReHLT Uzbek Representative Language Pack
_________________________________________________________________________

Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by LDC, the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorithm development for predicting personality traits. It contains 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

In-person interviews were recorded at LDC, FIT and UNH. In each session, the participant and interviewer were in separate sound-isolated rooms with communication between them supplied by audio/video hardware. Sessions consisted of the following tasks: rapport building, a YouTube task, a map task, and a business task. Further details on collection methodology and session tasks are contained in the documentation accompanying this release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Penn Parsed Corpora of Historical English Second Release was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This second release corrects errors and inconsistencies in Penn Parsed Corpora of Historical English (LDC2020T16), further streamlines annotation, simplifies the directory structure, and includes updated documentation.

This data set contains three corpora covering traditionally recognized periods of English:

The Penn-Helsinki Parsed Corpus of Middle English, second edition

The Penn-Helsinki Parsed Corpus of Early Modern English

The Penn Parsed Corpus of Modern British English, second edition

The texts are in two forms: part-of-speech tagged text and syntactically annotated text. Annotations were manually reviewed for accuracy and consistency. Included in this release are updated annotation guidelines, philological information for each corpus and the CorpusSearch 2 program, which allows users to search the data for words, word sequences and syntactic structure.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

LoReHLT Uzbek Representative Language Pack was developed by LDC and is comprised of approximately 47 million words of Uzbek monolingual text, 563,000 words of found Uzbek-English parallel text, 100,000 Uzbek words translated from English data, and 6.4 hours of Uzbek broadcast news and amateur web audio recordings. Approximately 151, 000 words were annotated for named entities and over 28,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 13,000 words. Over 20,890 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Data was collected from discussion forum, news, reference, social network, broadcast news, web audio recordings, and weblogs.

LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Wednesday, July 15, 2020

LDC 2020 July Newsletter

Penn Parsed Corpora of Historical English Now Available From LDC
Fall 2020 LDC Data Scholarship Program

New Publications:
Speech Sentiment Annotations
Penn Parsed Corpora of Historical English
IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b
BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training
____________________________________________________________

Penn Parsed Corpora of Historical English Now Available From LDC

LDC is pleased to announce that the Penn Parsed Corpora of Historical English (LDC2020T16) – an important community resource for 20 years – is now available for licensing in the LDC Catalog. Developed by University of Pennsylvania researchers in the Linguistics Department under the direction of Professor Anthony Kroch, this data set consists of syntactic annotation of English prose texts from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE) represented in three corpora:

The Penn-Helsinki Corpus of Middle English, second edition

The Penn-Helsinki Parsed Corpus of Early Modern English

The Penn Parsed Corpus of Modern British English, second edition

This release also includes annotation guidelines and philological information for each corpus as well as the CorpusSearch 2 program which allows users to search the data for words, word sequences and syntactic structure.

In addition to being of value to students and scholars of the history of English, this data set is useful to computational linguists for domain adaptation. More information about this project is available from the Penn Parsed Corpora of Historical English homepage.

Current licensees should contact LDC’s membership office with any questions regarding access to this data set.

Fall 2020 LDC Data Scholarship Program

Student applications for the Fall 2020 LDC Data Scholarship program are being accepted now through September 15, 2020. This scholarship program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.
____________________________________________________________

New publications:

(1) Speech Sentiment Annotations was developed by Google Inc. and consists of sentiment labels (positive, negative, neutral) for approximately 49,500 utterances covering 140 hours of audio from Switchboard-1 Release 2 (LDC97S62).

Switchboard speech files were segmented based on the start and end time of transcript turns. Annotators listened to the audio corresponding to each segment (utterance) and classified each into positive, negative or neutral categories based on the emotion and attitude of the speaker. Annotators provided a justification for positive and negative classifications using a flow chart. Further information about the methodology and annotation process is contained in the documentation accompanying this release.

Switchboard-1 Release 2 (LDC97S62) consists of 260 hours of telephone speech from 543 speakers across the United States (302 male speakers, 241 female speakers). A computer-driven telephone collection platform paired two subjects for each conversation and provided a discussion topic, ensuring that no two speakers conversed together more than once and no one speaker talked more than once on a given topic.

Speech Sentiment Annotations is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Penn Parsed Corpora of Historical English was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This data set contains three corpora covering traditionally recognized periods of English:

The Penn-Helsinki Parsed Corpus of Middle English, second edition
The Penn-Helsinki Parsed Corpus of Early Modern English
The Penn Parsed Corpus of Modern British English, second edition

The texts are in three forms: plain text, part-of-speech tagged text, and syntactically annotated text. This release also includes annotation guidelines, philological information for each corpus and the CorpusSearch 2 program, which allows users to search the data for words, word sequences and syntactic structure.

The Penn Parsed Corpora of Historical English were designed for students and scholars of the history of English, especially the historical syntax of the language. They have also been used by computational linguists for domain adaptation. See the Penn Parsed Corpora of Historical English homepage for more information about this project.

Penn Parsed Corpora of Historical English is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Javanese conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Javanese speech in this release represents the Central, Western, and Eastern Javanese dialect regions of Indonesia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training was developed by LDC and consists of 158,651 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of transcripts of Chinese conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC96S34, LDC96T16, LDC96S55) that were translated into English by professional translation agencies and annotated for the word alignment task.

The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.

BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.