Linguistic Data Consortium: June 2021

Tuesday, June 15, 2021

LDC June 2021 Newsletter

LDC data and commercial technology development

New Publications:
MyST Children’s Conversational Speech
BOLT Egyptian Arabic Treebank – Conversational Telephone Speech

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
(1) MyST Children’s Conversational Speech was developed by Boulder Learning Inc. It contains 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data was collected in two phases between 2008 and 2017. Spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System, a research-based science curriculum for grades K-8. Students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers.

Data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. Data is divided into development, test, and train partitions for use with ASR systems.

MyST Children’s Conversational Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic Treebank – Conversational Telephone Speech was developed by LDC and consists of Egyptian Arabic conversational telephone speech data with part-of-speech annotation, morphology, gloss, and syntactic tree annotation.

This release contains 153,171 tokens before clitics were split and 182,965 tree tokens after clitics were split for treebank annotation. The source data was selected from conversational telephone speech collected by LDC for the CALLHOME project that was transcribed and segmented into sentence units.

Annotations follow Penn Arabic Treebank guidelines which consist of: (a) part-of-speech tagging that divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss; and (b) Arabic treebanking, which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, and so on.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT Egyptian Arabic Treebank – Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.