Linguistic Data Consortium

Friday, September 15, 2023

LDC September 2023 Newsletter

LDC data and commercial technology development

New publications:

CALLFRIEND Russian Text
________________________________________________________________

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

CALLFRIEND Russian Speech was developed by LDC and consists of 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the CALLFRIEND collection, a project designed primarily to support research in automatic language identification. One hundred native Russian speakers living in the continental United States each made a single phone call, lasting up to 30 minutes, to a family member or friend living in the United States.

All recordings involved domestic calls routed through LDC’s automated telephone collection platform and stored as 2-channel (4-wire) 8-KHz mu-law samples taken directly from a public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data.

This release includes call metadata, including speaker gender, the number of speakers on each channel and call duration.

Corresponding transcripts and a lexicon are available in CALLFRIEND Russian Text (LDC2023T09).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLFRIEND Russian Text contains the corresponding transcripts and a lexicon for CALLFRIEND Russian Speech, that is, 48 hours of telephone conversations (100 recordings) between native Russian speakers.

The transcripts have four main fields on each line (begin_offset, end_offset, speaker_label, transcript_text) separated by tabs. Each contains a list of time-stamped segments in order according to their begin_offset values, with no blank lines.

The lexicon covers the word forms in the 97 transcript files. The main lexicon table contains three columns per row: Cyrillic orthography, phonetic transliteration and numeric representation of syllabic stress.

Corresponding speech data is available as CALLFRIEND Russian Speech (LDC2023S08).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Tuesday, August 15, 2023

LDC August 2023 Newsletter

LDC at Interspeech 2023

LDC releases speech activity detector

Fall 2023 LDC Data Scholarship Program

New publications:2019 OpenSAT Public Safety Communications SimulationSamrómur Queries Icelandic Speech 1.0

__________________________________________________________________________

LDC at Interspeech 2023

LDC is happy to be back in person as an exhibitor and longtime supporter of Interspeech, taking place this year August 20-24 in Dublin, Ireland. Stop by Stand A2 to say hello and learn about the latest developments at the Consortium. LDC is also delighted to once again be a silver sponsor for the Young Female Researchers in Speech Workshop and to provide data in support of the CHiME-7 challenge satellite workshop and the MERLIon CCS Challenge. LDC will post conference updates via our social media platforms. We look forward to seeing you in Dublin! LDC releases speech activity detector

LDC announces the release of the LDC Broad Phonetic Class Speech Activity Detector. Based on the broad phonetic class recognizer implemented in the HTK Speech Recognition Toolkit, LDC’s speech activity detector model runs the speech signal through a GMM-HMM recognizer to identify five broad phonetic classes: vowel, stops/affricate, fricative, nasal, and glide/liquid. The LDC Broad Phonetic Class Speech Activity Detector is available at no cost on github under a GPL v3 license.

Fall 2023 LDC Data Scholarship Program
Student applications for the Fall 2023 LDC Data Scholarship program are being accepted now through September 15, 2023. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

2019 OpenSAT Public Safety Communications Simulation contains 141 hours of English speech recordings and transcripts used in the NIST Open Speech Analytic Technologies (OpenSAT) 2019 evaluation's automatic speech recognition, speech activity detection, and keyword search tasks. The data is part of the SAFE-T (Speech Analysis For Emergency Response Technology) corpus created by LDC which is comprised of speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types, and noise levels.US English speakers played the board game Flash Point Fire Rescue. Background noise was played through a participant's headset during the recording session. Recording sessions consisted of 2 30-minute games. The corpus is divided into training, development, and evaluation data. 2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Samrómur Queries Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 20 hours of Icelandic prompted queries from 3,809 speakers representing 17,475 utterances.

Speech data was collected between October 2019 and December 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.2023 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.