Linguistic Data Consortium: May 2021

LDC at ICASSP 2021

New Publications:
The SSNCE Database of Tamil Dysarthric Speech
ESPADA
BOLT Chinese SMS/Chat Parallel Training Data

LDC at ICASSP 2021
LDC will be exhibiting at ICASSP 2021, held virtually this year June 6-11. Stop by our digital booth June 8-10 to learn more about recent developments at the Consortium and new publications.

Also, check out the following poster featuring LDC work:

Probing Acoustic Representations for Phonetic Properties
Wednesday, June 9, 14:00 - 14:45
Session: AUD-11: Auditory Modeling and Hearing Instruments

LDC will post conference links and updates via our Twitter feed and Facebook page. We hope to “see” you there!

New publications:

(1) The SSNCE Database of Tamil Dysarthric Speech was developed by the Speech Lab, SSN College of Engineering, India, in collaboration with the Indian National Institute of Empowerment of Persons with Multiple Disabilities (NIEPMD) and contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers).

The speech data was collected between 2015 and 2017 in two sessions at NIEPMD. Each speaker recorded 365 utterances consisting of single words and of sentences that included a combination of common and uncommon Tamil phrases. The non-dysarthric speakers were five female and five male subjects. The dysarthric speakers (7 female, 13 male) reported a diagnosis of cerebral palsy and ranged in age from 12 years old to 37 years old.

Dysarthria is a speech disorder caused by muscle weakness which can result in slowed and slurred speech that is difficult to understand. Common causes of dysarthria include nervous system disorders and conditions that cause facial paralysis or tongue or throat muscle weakness.

The SSNCE Database of Tamil Dysarthric Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) ESPADA (Extended Syntactic Phrase Alignment DAtaset) consists of annotated parse trees and alignment on English sentential paraphrases from NIST’s OpenMT evaluation corpora. It extends SPADE (LDC2018T09) by adding new annotated data for training/testing phrasal paraphrase detection and phrase representation models to SPADE's development and test sets. Gold standard annotations of HPSG (head-driven phrase structure grammar) trees and phrase alignments were performed, resulting in 251,972 phrase alignments identified in 1,916 sentential paraphrases.

ESPADA is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) BOLT Chinese SMS/Chat Parallel Training Data was developed by LDC and consists of approximately 1.8 million tokens of Chinese SMS/Chat data and their corresponding English translations.

The source data was donated or collected by LDC via live platforms. Data was manually selected for translation. Messages/conversations were arranged in chronological order, segmented into sentence units (all or portions of message threads depending on their length), and assigned to translation vendors. Translators followed LDC's BOLT translation guidelines.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT Chinese SMS/Chat Parallel Training Data is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, May 17, 2021

LDC May 2021 Newsletter