Showing posts with label American English speech. Show all posts
Showing posts with label American English speech. Show all posts

Tuesday, August 15, 2023

LDC August 2023 Newsletter

LDC at Interspeech 2023

LDC releases speech activity detector

Fall 2023 LDC Data Scholarship Program

__________________________________________________________________________

LDC at Interspeech 2023
LDC is happy to be back in person as an exhibitor and longtime supporter of Interspeech, taking place this year August 20-24 in Dublin, Ireland. Stop by Stand A2 to say hello and learn about the latest developments at the Consortium. LDC is also delighted to once again be a silver sponsor for the Young Female Researchers in Speech Workshop and to provide data in support of the CHiME-7 challenge satellite workshop and the MERLIon CCS ChallengeLDC will post conference updates via our social media platforms. We look forward to seeing you in Dublin! LDC releases speech activity detector
LDC announces the release of the LDC Broad Phonetic Class Speech Activity Detector. Based on the broad phonetic class recognizer implemented in the HTK Speech Recognition Toolkit, LDC’s speech activity detector model runs the speech signal through a GMM-HMM recognizer to identify five broad phonetic classes: vowel, stops/affricate, fricative, nasal, and glide/liquid. The LDC Broad Phonetic Class Speech Activity Detector is available at no cost on github under a GPL v3 license.    
Fall 2023 LDC Data Scholarship Program 
Student applications for the Fall 2023 LDC Data Scholarship program are being accepted now through September 15, 2023. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

2019 OpenSAT Public Safety Communications Simulation contains 141 hours of English speech recordings and transcripts used in the NIST Open Speech Analytic Technologies (OpenSAT) 2019 evaluation's automatic speech recognition, speech activity detection, and keyword search tasks. The data is part of the SAFE-T (Speech Analysis For Emergency Response Technology) corpus created by LDC which is comprised of speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types, and noise levels.US English speakers played the board game Flash Point Fire Rescue. Background noise was played through a participant's headset during the recording session. Recording sessions consisted of 2 30-minute games. The corpus is divided into training, development, and evaluation data. 2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

Samrómur Queries Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 20 hours of Icelandic prompted queries from 3,809 speakers representing 17,475 utterances.

Speech data was collected between October 2019 and December 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.2023 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.


Monday, May 16, 2022

LDC May 2022 Newsletter

30th Anniversary Highlight: Penn Treebank 

New publications:
Samrómur Icelandic Speech 1.0
_______________________________________________________________

30th Anniversary Highlight: Penn Treebank 
LDC’s Catalog features classic corpora responsible for critical advances in human language technology that continue to influence researchers. Among them are the Penn Treebank releases, Treebank-2 (LDC96T7) and Treebank-3 (LDC99T42).

The Penn Treebank project (1989-1996) produced seven million words tagged for part-of-speech, three million words of parsed text, over two million words annotated for predicate-argument structure and 1.6 million words of transcribed speech annotated for speech disfluencies (Taylor et al., 2003). Source material represents a diverse range of data, including Wall Street Journal (WSJ) articles, the Brown Corpus and Switchboard telephone conversations. 

Penn Treebanks are used for a wide range of purposes, including the creation and training of parsers and taggers, work on machine translation and speech recognition, and research concerning joint syntactic and semantic role labeling. Their ongoing influence is evidenced by the popularity of Treebank-3 (LDC99T42), which continues to be one of LDC’s top ten most distributed corpora in the Catalog. In addition, the WSJ section has served as a model for treebanks across many languages (Nivre, 2008).

The Penn Treebank has inspired related annotation schemes, such as Proposition Bank, the Penn Discourse Treebank project, and word alignment annotation. In addition, LDC has developed revised English treebank guidelines resulting in the re-issue of the WSJ section (English News Text Treebank: Penn Treebank Revised (LDC2015T13)) and treebanked web text (e.g., English Web Treebank (LDC2012T13) and BOLT English Translation Treebank – Chinese Discussion Forum (LDC2020T09)).   

Penn Treebank corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data

New publications:
(1) NUBUC (NyU-BU contextually controlled stories Corpus) was developed by New York UniversityMax Planck Institute for Empirical Aesthetics and Boston University. It contains approximately three hours of English read speech from eight stories focused on linguistic keywords that were created specifically for this corpus, along with transcripts, syntactic annotations and corpus metadata.

Stories are centered on a protagonist and bear a similarity to a modern fairy tale. Each story consists of approximately 2,000 words organized around critical keywords matched along multiple linguistic dimensions. The story texts comprise a total of 1024 sentences and 16,472 words. Each story was read by two different voice actors, one male and one female, in a neutral American English accent. 

Recordings are 11-12 minutes in duration, for a total of about 90 minutes of continuous speech per speaker.

NUBUC is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*
(2) Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances.

Speech data was collected between October 2019 and May 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.

Samrómur Icelandic Speech 1.0 is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.