Wednesday, September 15, 2021

LDC September 2021 Newsletter

New Publications:

RATS Speaker Identification

Classical Arabic Dictionary

DiscAlign for Penn and RST Discourse Treebanks

_________________________________________________________________

 

New publications:

(1) RATS Speaker Identification was developed by LDC and is comprised of approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotations of speech segments. The audio was retransmitted over eight channels, for 17,000 hours of total speech. The corpus was created to provide training and development sets for the speaker identification task in the DARPA RATS (Robust Automatic Transcription of Speech) program.   


The source audio consists of conversational telephone speech recordings collected by LDC specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, speaker ID, speaker ID provenance, language ID, and language ID provenance. 


RATS Speaker Identification is distributed via web download.


2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(2) Classical Arabic Dictionary consists of approximately one hundred million words of Arabic collected from texts dating between 431 and 1104 CE, principally books and essays, along with word occurrences, source documents and related metadata.


The dictionary is presented in three formats: plain text in UTF-8 encoding, plain text in CP1256 encoding, and as an SQL database file. Source documents are presented in UTF-8 and CP1256 encodings.


Classical Arabic Dictionary is distributed via web download.


2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(3) DiscAlign for Penn and RST Discourse Treebanks was developed by Saarland University. It consists of alignment information for the discourse annotations contained in Penn Discourse Treebank Version 2.0 (LDC2008T05) (PDTB 2.0) and RST Discourse Treebank (LDC2002T07) (RST-DT). PDTB 2.0 and RST-DT annotations overlap for 385 newspaper articles in sections 6, 11, 13, 19 and 23 of the Wall Street Journal corpus contained in Treebank-2 (LDC95T7). DiscAlign for Penn and RST Discourse Treebanks contains approximately 6,700 alignments between PDTB 2.0 and RST-DT relations. 


DiscAlign for Penn and RST Treebanks is available at no cost to all licensees of PDTB 2.0 and RST-DT and appears in their download queues associated with these corpora as DiscAlign_Penn_RST_DTB_LDC2021T16.zip.

Monday, August 16, 2021

LDC August 2021 Newsletter

LDC at Interspeech 2021  

Fall 2021 LDC Data Scholarship Program 

New Publications:
Wikipedia Spanish Speech and Transcripts 
BOLT Egyptian Arabic SMS/Chat Parallel Training Data


LDC at Interspeech 2021  

LDC will be exhibiting at Interspeech 2021 held this year, August 30 - September 3, in a hybrid in-person, virtual format. Stop by our digital booth for a look at a selection of documents and videos describing recent developments at the Consortium and new publications. You can also contact us through the conference platform to schedule a chat session. 

We’ll be hosting a live virtual video event highlighting LDC’s recent speech publications during the conference. Stay tuned for scheduling information to come!

LDC work will be featured in the following conference sessions: 

2011 Fearless Steps Challenge Phase-3 (FSC P3): Advancing SLT for Unseen Channel and Mission Data Across NASA Apollo Audio
Tuesday, August 31, 20:00 
Session: In-person Oral: ASR Technologies and systems 19:00-21:00

Using Games to Augment Corpora for Language Recognition and Confusability
Wednesday, September 1, 16:20-16:40 
Session: In-person Oral: Speaker, Language, and Privacy 16:00-18:00

The Third DIHARD Diarization Challenge
Thursday, September 2, 16:00 
Session: Virtual: Speaker Diarization II 16:00-18:00

LDC will post conference links and updates via our Twitter feed and Facebook page. We hope to “see” you at Interspeech 2021!

Fall 2021 LDC Data Scholarship Program 
Student applications for the Fall 2021 LDC Data Scholarship program are being accepted now through September 15, 2021. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.  

For application requirements and program rules, visit the LDC Data Scholarship page


New publications:

(1)  Wikipedia Spanish Speech and Transcripts consists of approximately 25 hours of Spanish read speech from Wikipedia Grabada, the Spanish version of WikiProject Spoken Wikipediaand corresponding transcripts. Speakers (150 male, 43 female) read Wikipedia articles; the audio files were segmented and transcribed by native Spanish speakers. Speaker metadata is included in this release. 

Wikipedia Spanish Speech and Transcripts is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) BOLT Egyptian Arabic SMS/Chat Parallel Training Data was developed by LDC and consists of approximately 723,000 tokens of Egyptian Arabic SMS/Chat data collected for the DARPA BOLT program along with their corresponding English translations.

The source data was manually reviewed to exclude any messages/conversations that were not in the target language or that had sensitive content, such as personal identifying information.

Data was manually selected for translation. Messages/conversations were arranged in chronological order, segmented into sentence units and assigned to translation vendors. Translators followed LDC's BOLT translation guidelines.

BOLT Egyptian SMS/Chat Parallel Training Data is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

 

Thursday, July 15, 2021

LDC July 2021 Newsletter

LDC Submissions: a new platform for sharing data through LDC 

Fall 2021 LDC Data Scholarship Program 

New Publications:
Ethnobotanical Research and Language Documentation of Nahuatl
Chinese Abstract Meaning Representation 2.0
BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech


LDC Submissions: a new platform for sharing data through LDC 
LDC is pleased to announce the launch of LDC Submissions, a platform that provides infrastructure and resources for sharing data through the Catalog. After registering for a user account, corpus submitters can create a submission, upload files, and communicate with LDC’s publications team during the review process. After all reviews are complete, the final, release-ready version of your data set is uploaded to the platform and enters the publications queue. 

Sharing your corpus through LDC ensures access to the global research community and the permanent preservation of your data according to best practices for archiving digital language resources. Get started and register for an LDC Submissions user account today.

Fall 2021 LDC Data Scholarship Program 
Student applications for the Fall 2021 LDC Data Scholarship program are being accepted now through September 15, 2021. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.  

For application requirements and program rules, visit the LDC Data Scholarship page


New publications:
(1) Ethnobotanical Research and Language Documentation of Nahuatl consists of approximately 190 hours of field recordings collected in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico. The corpus contains audio and video recordings of native Nahuatl speakers during the collection of particular plants; partial transcripts (Nahuatl and Spanish); a Highland Puebla Nahuat dictionary; botanical and ethnobotanical data; and speaker metadata.

Nahuatl is one of the most widely spoken indigenous languages in the Americas with approximately 1.5 million speakers in Mexico. Many distinct and sometimes mutually intelligible varieties have been recognized. The recordings in this release were collected between 2008 and 2019 in two different municipalities: Cuetzalan del Progreso and Tepetzintla. Speech from Cuetzalan represents Highland Puebla Nahuat, and speech from Tepetzintla represents Zacatlán-Ahuacatlám-Tepetzintla Nahuatl.

The recordings consist of a speaker talking about a plant's nomenclature, classification, and use. Transcripts are included for the Cuetzalan recordings; these transcripts have been partially translated into Spanish. A Highland Puebla Nahuat dictionary is included in both text and Toolbox XML formats. Botanical and ethnobotanical information is presented as a collection of pdfs, and images as jpegs.

Ethnobotanical Research and Language Documentation of Nahuatl is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2)  Chinese Abstract Meaning Representation 2.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of approximately 20,000 Chinese sentences from Chinese Treebank (CTB) 8.0 (LDC2013T21). CAMR 2.0 includes the content of Chinese Abstract Meaning Representation 1.0 (LDC2019T07) (CTB 8.0 weblog and discussion forum sentences), plus an additional 9,933 sentences from the newswire portion of CTB 8.0.

Abstract Meaning Representation (AMR) captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole sentence meaning in a tree structure. Chinese AMR is constructed following the basic principles developed for English: a compact, readable, whole-sentence semantic representation, while making adaptions where necessary to handle Chinese-specific phenomena.

The corpus contains 20,078 sentences from the weblog, discussion forum, and newswire portions of CTB 8.0. Three sets of files are included: the original Chinese AMR data with concept-to-word and relation-to-word alignments, a converted English AMR format, and a Chinese syntactic dependency tree format. Each set is divided into training, development and test sets.

Chinese Abstract Meaning Representation 2.0 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies. Co-reference annotation aims to fill in the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers, and verbs.

The source discussion forum data and SMS/Chat data was collected by LDC for the DARPA BOLT program. The telephone data was taken from LDC's Egyptian Arabic CALLHOME and CALLFRIEND telephone collections.

BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Tuesday, June 15, 2021

LDC June 2021 Newsletter

LDC data and commercial technology development 

New Publications:
MyST Children’s Conversational Speech
BOLT Egyptian Arabic Treebank – Conversational Telephone Speech


LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.


New publications:
(1) MyST Children’s Conversational Speech was developed by Boulder Learning Inc. It contains 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data was collected in two phases between 2008 and 2017. Spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System, a research-based science curriculum for grades K-8. Students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers. 

Data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. Data is divided into development, test, and train partitions for use with ASR systems.

MyST Children’s Conversational Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) BOLT Egyptian Arabic Treebank – Conversational Telephone Speech was developed by LDC and consists of Egyptian Arabic conversational telephone speech data with part-of-speech annotation, morphology, gloss, and syntactic tree annotation. 

This release contains 153,171 tokens before clitics were split and 182,965 tree tokens after clitics were split for treebank annotation. The source data was selected from conversational telephone speech collected by LDC for the CALLHOME project that was transcribed and segmented into sentence units.

Annotations follow Penn Arabic Treebank guidelines which consist of: (a) part-of-speech tagging that divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss; and (b) Arabic treebanking, which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, and so on.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT Egyptian Arabic Treebank – Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, May 17, 2021

LDC May 2021 Newsletter

LDC at ICASSP 2021 

New Publications:
The SSNCE Database of Tamil Dysarthric Speech
ESPADA
BOLT Chinese SMS/Chat Parallel Training Data


LDC at ICASSP 2021
LDC will be exhibiting at ICASSP 2021, held virtually this year June 6-11. Stop by our digital booth June 8-10 to learn more about recent developments at the Consortium and new publications.

Also, check out the following poster featuring LDC work:

Probing Acoustic Representations for Phonetic Properties
Wednesday, June 9, 14:00 - 14:45
Session: 
AUD-11: Auditory Modeling and Hearing Instruments

LDC will post conference links and updates via our Twitter feed and Facebook page. We hope to “see” you there!


New publications:


(1) The SSNCE Database of Tamil Dysarthric Speech was developed by the Speech Lab, SSN College of Engineering, India, in collaboration with the Indian National Institute of Empowerment of Persons with Multiple Disabilities (NIEPMD) and contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers).

The speech data was collected between 2015 and 2017 in two sessions at NIEPMD. Each speaker recorded 365 utterances consisting of single words and of sentences that included a combination of common and uncommon Tamil phrases. The non-dysarthric speakers were five female and five male subjects. The dysarthric speakers (7 female, 13 male) reported a diagnosis of cerebral palsy and ranged in age from 12 years old to 37 years old. 

Dysarthria is a speech disorder caused by muscle weakness which can result in slowed and slurred speech that is difficult to understand. Common causes of dysarthria include nervous system disorders and conditions that cause facial paralysis or tongue or throat muscle weakness.

The SSNCE Database of Tamil Dysarthric Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(2ESPADA (Extended Syntactic Phrase Alignment DAtaset) consists of annotated parse trees and alignment on English sentential paraphrases from NIST’s OpenMT evaluation corpora. It extends SPADE (LDC2018T09) by adding new annotated data for training/testing phrasal paraphrase detection and phrase representation models to SPADE's development and test sets. Gold standard annotations of HPSG (head-driven phrase structure grammar) trees and phrase alignments were performed, resulting in 251,972 phrase alignments identified in 1,916 sentential paraphrases.

ESPADA is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*


(3) BOLT Chinese SMS/Chat Parallel Training Data was developed by LDC and consists of approximately 1.8 million tokens of Chinese SMS/Chat data and their corresponding English translations.

The source data was donated or collected by LDC via live platforms. Data was manually selected for translation. Messages/conversations were arranged in chronological order, segmented into sentence units (all or portions of message threads depending on their length), and assigned to translation vendors. Translators followed LDC's BOLT translation guidelines.

The DARPA 
BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT Chinese SMS/Chat Parallel Training Data is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, April 15, 2021

LDC April 2021 Newsletter

New Publications:

X-SRL: Parallel Cross-lingual Semantic Role Labeling
TAC KBP English Sentiment Slot Filling – Comprehensive Training and Evaluation Data 2013-2014
_____________________________________________________________________________

New Publications:

(1) X-SRL: Parallel Cross-lingual Semantic Role Labeling was developed by Heidelberg University, Department of Computational Linguistics and the Leibniz Institute for the German Language (IDS). It consists of approximately three million words of German, French and Spanish annotated for semantic role labeling. The texts are translations of the English portion of 2009 CoNLL Shared Task Part 2 (LDC2012T04). All sentences have annotations for verbal predicates and share the original English Propbank label set across the four languages.

The 2009 CoNLL Shared Task developed syntactic dependency annotations, including the semantic dependency model roles of both verbal and nominal predicates. The following English data was used in the shared task:
For X-SRL, the English source data was automatically translated using DeepL. Automatic tokenization, lemmatization, part-of-speech tagging and syntactic parsing were then applied to the text. The data was divided into train, development and test partitions. Semantic labels were transferred for the train and development sections, and the test sentences were validated for translation quality, alignment, label transfer, and filtering.

X-SRL: Parallel Cross-lingual Semantic Role Labeling is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) TAC KBP English Sentiment Slot Filling – Comprehensive Training and Evaluation Data 2013-2014 was developed by LDC and contains training and evaluation data produced in support of the 2013 and 2014 TAC KBP Sentiment Slot Filling tracks. The data in this release includes queries, manual runs (human-produced query responses), and assessment results for human- and system-produced query responses. Source data was English news and web text.

The regular English Slot Filling track involved mining information about entities from text using a specified set of "slots", or attributes. The goal of the Sentiment Slot Filling task was to evaluate the quality of detectors for positive and negative sentiment.

TAC KBP English Sentiment Slot filling – Comprehensive Training and Evaluation Data 2013-2014 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, March 15, 2021

LDC 2021 March Newsletter

LDC data and commercial technology development 

New Publications:
Columbia Games Corpus
Global TIMIT Mandarin Chinese
BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

_________________________________________________________________________


LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.



New publications:

(1) Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation from 13 subjects playing a series of computer games that required verbal communication to achieve joint goals of identifying and moving images on the screen to reach a combined number of points. This publication also includes corresponding manually time-aligned orthographic transcripts and annotation marking discourse and turn-taking.

Columbia Games Corpus is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 


*

(2Global TIMIT Mandarin Chinese was developed by LDC and Shanghai Jiao Tong University and consists of five hours of read speech from Chinese Gigaword Fifth Edition (LDC2011T13) with corresponding transcripts. Fifty speakers read 120 sentences; specifically, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types.

The corpus was recorded at Shanghai Jiao Tong University, China. Speakers (25 female, 25 male) were students at the university and had achieved Class 2 Level 1 or better on Putonghua Shuiping Ceshi (the national standard Mandarin proficiency test).

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original 
TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. 

Global TIMIT Mandarin Chinese is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*


(3) BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Chinese informal text. 

Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation (i.e., 
Chinese Treebank 9.0 (LDC2016T13)) and covers noun phrases (including proper nouns, nominals, pronouns, and null arguments), possessives, proper noun pre-modifiers, and verbs.

Discussion forum data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. Telephone speech data was taken from LDC's Chinese CALLHOME and CALLFRIEND telephone collections.

The DARPA 
BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.