New Publications:
First DIHARD Challenge Development - Eight Sources
_____________________________________________________________________
New publications:
(1) DEFT Spanish Committed Belief Annotation was developed by LDC and consists of approximately 67,000 tokens of Spanish discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.
(1) DEFT Spanish Committed Belief Annotation was developed by LDC and consists of approximately 67,000 tokens of Spanish discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.
DARPA's Deep Exploration and Filtering of Text (DEFT)
program aimed to address remaining capability gaps in state-of-the-art natural
language processing technologies related to inference, causal relationships and
anomaly detection. LDC supported the DEFT program by collecting, creating and
annotating a variety of data sources.
DEFT Spanish Committed Belief Annotation is distributed via
web download.
2019 Subscription Members will automatically receive copies
of this corpus. 2019 Standard Members may request a copy as part of their 16
free membership corpora. Non-members may license this data for a cost.
*
(2) USC-SFI MALACH Interviews and
Transcripts English – Speech Recognition Edition was
developed by IBM as part of the MALACH
(Multilingual Access to Large Spoken ArCHives) Project and contains
approximately 168 hours of interviews from 682 Holocaust witnesses along with
transcripts, a lexicon and other documentation. This release augments USC-SFI
MALACH Interviews and Transcripts English (LDC2012S05) by modifying
and updating a subset of the original corpus for use with speech recognition
systems, such as the Kaldi toolkit.
Specifically, the audio
data has been converted from unsegmented mpeg files to a segmented flac
compressed format. The speaker-turn, time-stamped transcripts have been updated
to an utterance-by-utterance format. A lexicon mapping words to phonemes is
provided, and the data is divided into development and training sets.
The goal of the MALACH
project was to develop methods for improved access to large multinational
spoken archives in order to advance the state of the art of automatic speech
recognition and information retrieval. The characteristics of the USC-SFI
collection -- unconstrained, natural speech filled with disfluencies, heavy
accents, age-related coarticulations, un-cued speaker and language switching
and emotional speech -- were considered well-suited for that task.
USC-SFI MALACH Interviews and Transcripts English – Speech Recognition
Edition is distributed via web download.
2019 Subscription Members will automatically receive copies
of this corpus provided they have submitted a completed copy of the special
license agreement. 2019 Standard Members may request a copy as part of their 16
free membership corpora. Non-members may license this data at no cost.
*
(3) First
DIHARD Challenge Development - Eight Sources was developed by LDC and
contains approximately 17 hours of English and Chinese speech data along with
corresponding annotations used in support of the First DIHARD Challenge.
This release, when combined with First DIHARD Challenge Development - SEEDLingS
(LDC2019S10), contains
the development set audio data and annotation (diarization, segmentation) as
well as the official scoring tool.
The First DIHARD
Challenge was an attempt to reinvigorate work on diarization through a shared
task focusing on "hard" diarization; that is, speech diarization for
challenging corpora where there was an expectation that existing
state-of-the-art systems would fare poorly. As such, it included speech from a
wide sampling of domains representing diversity in number of speakers, speaker
demographics, interaction style, recording quality, and environmental
conditions as follows (all sources are in English unless otherwise indicated):
- Autism Diagnostic Observation Schedule (ADOS) interviews
- DCIEM/HCRC map task (LDC96S38)
- Audiobook recordings from LibriVox
- Meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) Development (LDC2007S11) and Evaluation (LDC2007S12) releases.
- 2001 U.S. Supreme Court oral arguments
- Sociolinguistic interviews from SLX Corpus of Classic Sociolinguistic Interviews (LDC2003T15)
- Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project
- YouthPoint radio interviews
First DIHARD Challenge
Development - Eight Sources is distributed via web download.
2019 Subscription Members will automatically receive copies of
this corpus. 2019 Standard
Members may request a copy as part of their 16 free membership corpora. Non-members
may license this data for a cost.
*
(4) First
DIHARD Challenge Development - SEEDLingS was
developed by Duke University and LDC and contains approximately two hours of
English child language recordings along with corresponding annotations used in
support of the First
DIHARD Challenge. This release, when combined with First DIHARD Challenge
Development - Eight Sources (LDC2019S09),
contains the development set audio data and annotation (diarization,
segmentation) as well as the official scoring tool.
The source data was drawn from the SEEDLingS
(The Study of Environmental Effects on Developing Linguistic Skills) corpus,
designed to investigate how infants' early linguistic and environmental input
plays a role in their learning. Recordings for SEEDLingS were generated in the
home environment of 44 infants from 6-18 months of age in the Rochester, New
York area. A subset of that data was annotated by LDC for use in the First
DIHARD Challenge.
The First DIHARD
Challenge was an attempt to reinvigorate work on diarization through a shared
task focusing on "hard" diarization; that is, speech diarization for
challenging corpora where there was an expectation that existing state-of-the-art
systems would fare poorly. As such, it included speech from a wide sampling of
domains representing diversity in number of speakers, speaker demographics,
interaction style, recording quality, and environmental conditions.
First DIHARD Challenge Development – SEEDLingS is
distributed via web download.
2019 Subscription Members will receive copies of this corpus
provided they have submitted a completed copy of the special license agreement.
2019 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a cost.
*
No comments:
Post a Comment