Linguistic Data Consortium: RATS

Showing posts with label RATS. Show all posts

Sunday, March 17, 2024

LDC March 2024 Newsletter

LDC data and commercial technology development

New publications:

___________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
RATS Low Speech Density was developed by LDC and is comprised of 87 hours of English, Levantine Arabic, Farsi, Pashto and Urdu speech and non-speech samples. The recordings were assembled by concatenating a randomized selection of speech, communications systems sounds, and silence. This corpus was created to measure false alarm performance in RATS speech activity detection systems.

The source audio was extracted from RATS development and progress sets and consists of conversational telephone speech recordings collected by LDC. Non-speech samples were selected from communications systems sounds, including telephone network special information tones, radio selective calling signals, HF/VHF/UHF digital mode radio traffic, radio network control channel signals, two-way radio traffic containing roger beeps, and short duration shift-key modulated handset data transmissions.

The goal of the RATS (Robust Automatic Transcription of Speech) program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

BabyEars Affective Vocalizations contains 22 minutes of spontaneous English speech by 12 adults interacting with their infant children, for a total of 509 infant-directed utterances and 185 adult-directed or neutral utterances. Speech data was collected in a quiet room during a one-hour session where each parent was asked to play and otherwise interact normally with their infant (aged 10-18 months). A trained research assistant then extracted discrete utterances and classified them in three categories: approval, attention and prohibition.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Monday, July 16, 2018

LDC 2018 July Newsletter

Fall 2018 Data Scholarship Program

New Publications:

CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition

RATS Language Identification

TRAD Chinese-French Parallel Text – Broadcast News

_________________________________________________________________________

Fall 2018 LDC Data Scholarship Program

Student applications for the Fall 2018 LDC Data Scholarship program are being accepted now through September 15, 2018. This scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.

New publications:

(1) CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition was developed by LDC and consists of approximately 24 hours of unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) RATS Language Identification was developed by LDC and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Language Identification (LID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings from: (1) conversational telephone speech (CTS) recordings, taken either from previous LDC CTS corpora, or from CTS data collected specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers; and (2) portions of VOA broadcast news recordings, taken from data used in the 2009 NIST Language Recognition Evaluation. The 2009 LRE Test Set is available from LDC as LDC2014S06.

CTS recordings were audited by annotators who listened to short segments and determined whether the audio was in the target language. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, language ID and LID provenance.

RATS Language Identification is distributed via hard drive.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TRAD Chinese-French Parallel Text -- Broadcast News was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 (LDC2008T18). The purpose of the PEA-TRAD project (Translation as a Support for Document Analysis) was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.

This release consists of 977 segments (translation units) from 139 documents. The Chinese source file contains 33,571 characters and the French reference translation contains 22,424 words. The source data is Chinese broadcast news collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.

TRAD Chinese-French Parallel Text – Broadcast News is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, October 18, 2017

LDC October 2017 Newsletter

LDC Awards Fall Data Scholarships

Membership Year 2018 Publication Preview

New Publications:RATS Keyword Spotting

English Web Treebank Propbank

Ancient Chinese Corpus

MWE-Aware English Dependency Corpus Version 2.0 _________________________________________________________________________

LDC Awards Fall Data Scholarships

LDC is pleased to award fifteen data scholarships to students this fall. Recipients are from eight countries and a variety of academic disciplines. Twenty unique data sets are awarded to the students for their work in diverse applications including machine translation, abstractive text summarization using recurrent neural networks, speech recognition for multiple languages, semantic role labeling for social data, text summarization, speaker recognition for forensic applications, and more. Please look to LDC’s social media pages for upcoming announcements highlighting each recipient and their intended research. Congratulations to all of our recipients!

Membership Year 2018 Publication Preview

The 2018 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:

Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
DEFT: Spanish Treebank (newswire, web data)
RATS Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
German children’s handwriting (longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns)

Check your inbox in the coming weeks for more information about membership renewal.

New publications:

(1) RATS Keyword Spotting was developed by LDC and is comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts, and keywords generated from transcript content. The corpus was created to provide training, development, and initial test sets for the keyword spotting (KWS) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic and Farsi speakers; and (2) material from Levantine Arabic QT Training Data Set 5, Speech (LDC2006S29) and CALLFRIEND Farsi Second Edition Speech (LDC2014S01). Transcripts of calls were either produced or available from the source corpora. Potential target keywords were selected from the transcripts based on word frequencies to fall within a range of target-word likelihood per hour of speech. The selected words were manually reviewed to confirm that each was a regular or multi-word expression of more than three syllables.

RATS Keyword Spotting is distributed via hard drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) English Web Treebank Propbank was developed by University of Colorado Boulder - CLEAR (Computational Language and Education Research) and provides predicate-argument structure annotation for English Web Treebank (LDC2012T13).

The goal of Propbank (or proposition bank) annotation is to develop annotations with information about basic semantic propositions. English Web Treebank Propbank provides semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses, and all nouns considered to be predicative. Mark-up is in the "unified" propbank annotation format, which combines representations in nouns, verbs, and adjectives. The source data consists of weblogs, newsgroups, email, reviews, and questions-answers.

English Web Treebank Propbank is distributed via Web Download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). This release is part of a continuing project to develop a large, part-of-speech tagged ancient Chinese corpus. It consists of 180,000 Chinese characters and 195,000 segment units (including words and punctuation). The part-of-speech tag set was developed by Nanjing Normal University and contains 17 tags. The files are presented in UTF-8 plain text files using traditional Chinese script.

Ancient Chinese Corpus is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from OntoNotes Release 5.0 (LDC2013T19).

Version 2.0 adds annotations of named entities (persons, locations, organizations) into dependency trees that are aware of compound function words. Version 1.0 is available from LDC as MWE-Aware English Dependency Corpus (LDC2017T01).

MWEs (multiword expressions) were identified in OntoNotes' phrase structure trees and each MWE was established as a single subtree. Those phrase structure subtrees were then converted to a dependency structure (the Stanford dependencies) in CoNLL format. The data is split into 1,728 phrase structure trees as *.parse files and a single 14-column tab separated dependency as a *.conll file. Both file types are encoded as UTF-8.

MWE-Aware English Dependency Corpus Version 2.0 is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.