Linguistic Data Consortium: Switchboard

Showing posts with label Switchboard. Show all posts

Thursday, September 15, 2022

LDC September 2022 Newsletter

Upcoming Policy Change to LDC’s Open Memberships

LDC at Interspeech 2022

LanguageARC: Citizen Science for Language

30th Anniversary Highlight: Switchboard

New publications:
Xi’an Guanzhong Object Naming
MASRI Synthetic
_____________________________________________________________

Upcoming Policy Change to LDC’s Open Memberships

LDC is changing its open membership year policy beginning January 1, 2023. Only one membership year will be open for joining – the current membership year. The 2022 membership year will close for joining on December 31, 2022. We expect this change to have a minimal impact on members, while allowing us to streamline our processes to serve members better. LDC’s many membership benefits will remain the same and organizations choosing to join membership years in advance will still be able to do so. If you have any questions about this change, please don’t hesitate to contact our membership office.

LDC at Interspeech 2022

LDC is proud to sponsor the Workshop for Young Female Researchers in Speech (YFRSW) to be held in-person as an Interspeech 2022 pre-conference satellite event on September 17. Also, be sure to check out the collaborative work of LDC’s Mark Liberman, “The mapping between syntactic and prosodic phrasing in English and Mandarin”, presented during the On-Site Oral Session: Phonetics and Phonology on Wednesday, September 21, 13:30-15:30 KST.

LanguageArc: Citizen Science for Language

LanguageARC is a citizen science web portal for language research developed by LDC with the support of the National Science Foundation (grant #1730377).

LanguageARC brings together researchers and participants from the general public interested in language to form a community dedicated to support and advance language-related research and development. Contributors to this online community can participate in a variety of language-related tasks and activities such as reading text, answering questions, describing images or video, creating or evaluating transcriptions for audio clips or developing translations into their native languages. LanguageARC includes projects in languages other than English, such as French, Sesotho and Swedish. Xi’an Guanzhong Object Naming LDC2022S09, released this month in LDC’s Catalog and described below, is an example of a data set developed using LanguageArc. New projects will be added on an ongoing basis.

Sign up for a LanguageARC account today to start making real contributions to language knowledge and research. Please share this information with colleagues, students and anyone who might be interested in participating in the language activities on this website. If you are a researcher interested in creating a project on Language ARC, please reach out on the site’s Contact page.

Find LanguageArc on Facebook at: https://www.facebook.com/languagearc

30th Anniversary Highlight: Switchboard

Switchboard-1 Release 2 (LDC97S62) is considered the first large collection of spontaneous conversational telephone speech (Graff & Bird, 2000). It consists of approximately 260 hours of recordings collected by Texas Instruments in 1990-1991 (Godfrey et al., 1992). The first release of the corpus (later superseded) was published by NIST and distributed by LDC in 1993.

Participants were 543 speakers (302 male, 241 female) from across the United States who accounted for around 2,400 two-sided telephone conversations. A robot operator handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. Roughly 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.

This gold standard data set has been used for many HLT applications, including speaker identification, speaker authentication, and speech recognition. It is considered one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (Deshmukh et al., 1998) as well as a key resource for studying the phonetic properties of spontaneous speech (Greenberg et al., 1996). Annotation tasks based on Switchboard include discourse tags/speech acts, part-of-speech tagging and parsing, and sentiment analysis.

The Switchboard series includes Switchboard Credit Card, Phase II, Phase III, the Switchboard Cellular collection, and new recordings from 18 Switchboard participants in the 2013 Greybeard corpus.

All Switchboard corpora are available in the Catalog for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publications:

Xi’an Guanzhong Object Naming is comprised of 15 hours of audio recordings from speakers of the Guanzhong dialect of Mandarin Chinese living in or near Xi'an in Shaangxi Province (China) naming objects that appeared in colored line drawings. The corpus was developed to support traditional and computer aided language documentation.

The collection was conducted from February-May 2021 using LanguageArc, a citizen science portal developed by LDC, from a closed volunteer community. Speakers were presented with images selected from the MultiPic dataset and were asked to record themselves naming the objects in the images.

Xi’an Guanzhong Object Naming is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

MASRI Synthetic MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and contains 99 hours of synthesized Maltese speech.

Source sentences were extracted from the Maltese Language Resource Server (MLRS) corpus, comprised of written or transcribed Maltese covering various genres, including parliamentary debates, news, law, opinion, sports, culture, academic, literature and religious texts. Text was processed through the CrimsonWing text-to-speech system to generate speech files. Synthesized speech was created with 210 voices (105 female, 105 male).

MASRI Synthetic is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, July 15, 2020

LDC 2020 July Newsletter

Penn Parsed Corpora of Historical English Now Available From LDC
Fall 2020 LDC Data Scholarship Program

New Publications:
Speech Sentiment Annotations
Penn Parsed Corpora of Historical English
IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b
BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training
____________________________________________________________

Penn Parsed Corpora of Historical English Now Available From LDC

LDC is pleased to announce that the Penn Parsed Corpora of Historical English (LDC2020T16) – an important community resource for 20 years – is now available for licensing in the LDC Catalog. Developed by University of Pennsylvania researchers in the Linguistics Department under the direction of Professor Anthony Kroch, this data set consists of syntactic annotation of English prose texts from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE) represented in three corpora:

The Penn-Helsinki Corpus of Middle English, second edition

The Penn-Helsinki Parsed Corpus of Early Modern English

The Penn Parsed Corpus of Modern British English, second edition

This release also includes annotation guidelines and philological information for each corpus as well as the CorpusSearch 2 program which allows users to search the data for words, word sequences and syntactic structure.

In addition to being of value to students and scholars of the history of English, this data set is useful to computational linguists for domain adaptation. More information about this project is available from the Penn Parsed Corpora of Historical English homepage.

Current licensees should contact LDC’s membership office with any questions regarding access to this data set.

Fall 2020 LDC Data Scholarship Program

Student applications for the Fall 2020 LDC Data Scholarship program are being accepted now through September 15, 2020. This scholarship program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.
____________________________________________________________

New publications:

(1) Speech Sentiment Annotations was developed by Google Inc. and consists of sentiment labels (positive, negative, neutral) for approximately 49,500 utterances covering 140 hours of audio from Switchboard-1 Release 2 (LDC97S62).

Switchboard speech files were segmented based on the start and end time of transcript turns. Annotators listened to the audio corresponding to each segment (utterance) and classified each into positive, negative or neutral categories based on the emotion and attitude of the speaker. Annotators provided a justification for positive and negative classifications using a flow chart. Further information about the methodology and annotation process is contained in the documentation accompanying this release.

Switchboard-1 Release 2 (LDC97S62) consists of 260 hours of telephone speech from 543 speakers across the United States (302 male speakers, 241 female speakers). A computer-driven telephone collection platform paired two subjects for each conversation and provided a discussion topic, ensuring that no two speakers conversed together more than once and no one speaker talked more than once on a given topic.

Speech Sentiment Annotations is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Penn Parsed Corpora of Historical English was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This data set contains three corpora covering traditionally recognized periods of English:

The Penn-Helsinki Parsed Corpus of Middle English, second edition
The Penn-Helsinki Parsed Corpus of Early Modern English
The Penn Parsed Corpus of Modern British English, second edition

The texts are in three forms: plain text, part-of-speech tagged text, and syntactically annotated text. This release also includes annotation guidelines, philological information for each corpus and the CorpusSearch 2 program, which allows users to search the data for words, word sequences and syntactic structure.

The Penn Parsed Corpora of Historical English were designed for students and scholars of the history of English, especially the historical syntax of the language. They have also been used by computational linguists for domain adaptation. See the Penn Parsed Corpora of Historical English homepage for more information about this project.

Penn Parsed Corpora of Historical English is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Javanese conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Javanese speech in this release represents the Central, Western, and Eastern Javanese dialect regions of Indonesia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training was developed by LDC and consists of 158,651 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of transcripts of Chinese conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC96S34, LDC96T16, LDC96S55) that were translated into English by professional translation agencies and annotated for the word alignment task.

The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.

BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.