Linguistic Data Consortium: July 2020

Penn Parsed Corpora of Historical English Now Available From LDC
Fall 2020 LDC Data Scholarship Program

New Publications:
Speech Sentiment Annotations
Penn Parsed Corpora of Historical English
IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b
BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training
____________________________________________________________

Penn Parsed Corpora of Historical English Now Available From LDC

LDC is pleased to announce that the Penn Parsed Corpora of Historical English (LDC2020T16) – an important community resource for 20 years – is now available for licensing in the LDC Catalog. Developed by University of Pennsylvania researchers in the Linguistics Department under the direction of Professor Anthony Kroch, this data set consists of syntactic annotation of English prose texts from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE) represented in three corpora:

The Penn-Helsinki Corpus of Middle English, second edition

The Penn-Helsinki Parsed Corpus of Early Modern English

The Penn Parsed Corpus of Modern British English, second edition

This release also includes annotation guidelines and philological information for each corpus as well as the CorpusSearch 2 program which allows users to search the data for words, word sequences and syntactic structure.

In addition to being of value to students and scholars of the history of English, this data set is useful to computational linguists for domain adaptation. More information about this project is available from the Penn Parsed Corpora of Historical English homepage.

Current licensees should contact LDC’s membership office with any questions regarding access to this data set.

Fall 2020 LDC Data Scholarship Program

Student applications for the Fall 2020 LDC Data Scholarship program are being accepted now through September 15, 2020. This scholarship program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.
____________________________________________________________

New publications:

(1) Speech Sentiment Annotations was developed by Google Inc. and consists of sentiment labels (positive, negative, neutral) for approximately 49,500 utterances covering 140 hours of audio from Switchboard-1 Release 2 (LDC97S62).

Switchboard speech files were segmented based on the start and end time of transcript turns. Annotators listened to the audio corresponding to each segment (utterance) and classified each into positive, negative or neutral categories based on the emotion and attitude of the speaker. Annotators provided a justification for positive and negative classifications using a flow chart. Further information about the methodology and annotation process is contained in the documentation accompanying this release.

Switchboard-1 Release 2 (LDC97S62) consists of 260 hours of telephone speech from 543 speakers across the United States (302 male speakers, 241 female speakers). A computer-driven telephone collection platform paired two subjects for each conversation and provided a discussion topic, ensuring that no two speakers conversed together more than once and no one speaker talked more than once on a given topic.

Speech Sentiment Annotations is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Penn Parsed Corpora of Historical English was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This data set contains three corpora covering traditionally recognized periods of English:

The Penn-Helsinki Parsed Corpus of Middle English, second edition
The Penn-Helsinki Parsed Corpus of Early Modern English
The Penn Parsed Corpus of Modern British English, second edition

The texts are in three forms: plain text, part-of-speech tagged text, and syntactically annotated text. This release also includes annotation guidelines, philological information for each corpus and the CorpusSearch 2 program, which allows users to search the data for words, word sequences and syntactic structure.

The Penn Parsed Corpora of Historical English were designed for students and scholars of the history of English, especially the historical syntax of the language. They have also been used by computational linguists for domain adaptation. See the Penn Parsed Corpora of Historical English homepage for more information about this project.

Penn Parsed Corpora of Historical English is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Javanese conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Javanese speech in this release represents the Central, Western, and Eastern Javanese dialect regions of Indonesia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training was developed by LDC and consists of 158,651 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of transcripts of Chinese conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC96S34, LDC96T16, LDC96S55) that were translated into English by professional translation agencies and annotated for the word alignment task.

The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.

BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Wednesday, July 15, 2020

LDC 2020 July Newsletter