Linguistic Data Consortium: DIHARD Challenge

Wednesday, June 15, 2022

LDC June 2022 Newsletter

LDC at LREC 2022

LDC data and commercial technology development

30th Anniversary Highlight: TIMIT

New publication:

Second DIHARD Challenge Evaluation - Eleven Sources

LDC at LREC 2022
LDC will attend the 13th Language Resource Evaluation Conference (LREC2022), hosted by ELRA, the European Language Resource Association, in Marseille, France June 20-25, 2022. Several LDC staff members will be presenting current work on topics including WeCanTalk: A New Multi-language, Multi-modal Resource for Speaker Recognition; Reflections on 30 Years of Language Resource Development and Sharing; A Study in Contradiction: Data and Annotation for AIDA Focusing on Informational Conflict in Russia-Ukraine Relations; Data Protection, Privacy and US Regulation; BeSt: The Belief and Sentiment Corpus; and more.

Stay tuned for specific announcements on LDC’s social media pages regarding presentation times and locations. Following the conference, LDC’s presented papers and posters will be available on the Papers Page.

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

30th Anniversary Highlight: TIMIT
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is another of the classic releases in LDC’s Catalog. Designed for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems, it contains recordings of 630 American English speakers each reading 10 phonetically rich sentences, for a total of 6300 utterances comprising 2342 distinct sentences. Data collection and annotation were a joint effort by Texas Instruments, the Massachusetts Institute of Technology and SRI International, and the data release was prepared by NIST (National Institute of Standards and Technology).

TIMIT was among the first publications that appeared with the launch of LDC’s catalog in 1993. It remains one of the Consortium’s top ten distributed corpora and may be the single most widely-used speech database. Despite its age and small size relative to modern data sets, TIMIT’s wide range of phonetically-representative inputs, its time-aligned lexical and phonemic transcripts, and its easy availability through the LDC Catalog have contributed to its widespread use and continued popularity. Thousands of researchers remember its famous first sentence: “she had your dark suit in greasy wash water all year”.

LDC continues the TIMIT series with its Global TIMIT project which aims to create a series of corpora in a variety of languages with TIMIT-like features. (Chanchaochai et al., 2018). Data sets published from that project include: Global TIMIT Learner Treebank English, Global TIMIT Learner Simple English, Global TIMIT Mandarin Chinese – Guanzhong Dialect, and Global TIMIT Mandarin Chinese.

The LDC Catalog features over 900 holdings in more than 90 languages and more data is added each year. All TIMIT corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publication:

Second DIHARD Challenge Evaluation - Eleven Sources was developed by LDC and contains approximately 20 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge.

The DIHARD second development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and web videos. Annotations include diarization and segmentation.

Second DIHARD Challenge Evaluation - Eleven Sources is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, November 15, 2021

LDC November 2021 Newsletter

Join LDC for Membership Year 2022

Spring 2022 Data Scholarship Application Deadline

New Publications:

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Second DIHARD Challenge Development – Eleven Sources

Second DIHARD Challenge Development - SEEDLingS

________________________________________________________________

Join LDC for Membership Year 2022

Membership Year 2022 (MY2022) is open and discounts are available for those who keep their membership current and join early. Current MY2021 members who renew their LDC membership before March 1, 2022 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount when joining by March 1.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data from our Catalog of 900 holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for MY2022 publications are in progress. Among the expected releases are:

2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation

AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13

Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names

MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts

HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task

DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data

LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof)

It’s not too late to join LDC for MY2020 (through December 31, 2021) and MY2021 (through December 31, 2022). Data sets from those years include 2018 NIST Speaker Recognition Evaluation Test Set, Mixer 4 and 5 Speech, AMR Annotation Release 3.0, Penn Parsed Corpora of Historical English, RATS Speaker Identification, BOLT Egyptian Arabic and Chinese resources (treebanks, propbanks, co-reference), Global TIMIT Mandarin Chinese, and MyST Children’s Conversational Speech.

For full descriptions of all LDC data sets, browse our Catalog.

Visit Join LDC for details on membership, user accounts and payment.

Spring 2022 Data Scholarship Application Deadline

Applications are now being accepted through January 15, 2022 for the Spring 2022 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.

New publications:

(1) BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) for the DARPA BOLT program and consists of propbank annotation on Egyptian Arabic informal text and telephone speech.

Propbank annotation provides a layer of semantic annotation over treebank. In this release, it was applied to BOLT phrase structure treebank annotation and was carried out in two phases: (1) a frame file for each predicate was created, and (2) the predicate argument structure was annotated using the frame file as a reference.

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Second DIHARD Challenge Development - Eleven Sources was developed by LDC and contains approximately 22 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge.

The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As with the first challenge, the second development and evaluation sets were drawn from a diverse sampling of sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and amateur web videos.

Second DIHARD Challenge Development – Eleven Sources is distributed via web download.

(3) Second DIHARD Challenge Development - SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challenge. The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly.

Source data is from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the first and second DIHARD Challenges.

The data in this release consists of files provided in the Second DIHARD Challenge as well as subsequently updated annotated files not provided to second challenge participants.

Second DIHARD Challenge Development – SEEDLingS is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.