Mixer 6 now available
Fall 2013 LDC Data
Scholarship Program - deadline approaching!
LDC at Interspeech 2013,
Lyon France
New publications:
GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2
MADCAT Phase 3 Training Set
Mixer 6 Speech
GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2
MADCAT Phase 3 Training Set
Mixer 6 Speech
Mixer
6 now available!
The release of Mixer 6 Speech this month marks the first time in close to a decade that LDC has made available a large-scale speech training data collection. Representing more than 15,000 hours of speech from over 500 speakers, Mixer 6 follows in the footsteps of the Switchboard and Fisher studies by providing a large database of rich telephone conversations with the addition of subject interviews and transcript readings. Participants were native American English speakers local to the Philadelphia area, providing further scope for a variety of research tasks. Mixer 6 Speech is a members-only release and a great reason to join the consortium. In addition to this substantial resource, members enjoy rights to other data released in 2013 and can license older publications at reduced fees.
Please see the full
description of Mixer 6 Speech.
The deadline for the Fall 2013 LDC Data
Scholarship Program is one month away! Student applications are
being accepted now through
September
16, 2013, 11:59PM EST. The LDC Data
Scholarship program provides university students with access to
LDC data at no cost. This program is open to students pursuing
both undergraduate and graduate studies in an accredited college
or university. LDC Data Scholarships are not restricted to any
particular field of study; however, students must demonstrate a
well-developed research agenda and a bona fide inability to pay.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can email their applications to the LDC Data
Scholarship program. Decisions will be sent by email from
the same address.
LDC will once
again be exhibiting at Interspeech held this year August 25-29 in Lyon.
Please stop by LDC’s booth to to learn about recent developments
at the Consortium, including new publications.
Also, be on
the lookout for the following presentations:
- Speech Activity Detection on YouTube Using Deep Neural Networks
- Neville Ryant, Mark Liberman, Jiahong Yuan (all LDC)
- Monday 26 August, Poster 6, 16.00 – 18.00
- Room: Forum 6
- The Spectral Dynamics of Vowels in Mandarin Chinese
- Jiahong Yuan (LDC)
- Tuesday 27 August, Oral 17, 14.00 – 16.00
- Room: Gratte-Ciel 3
- Automatic Phonetic Segmentation using Boundary Models
- Jiahong Yuan (LDC), Neville Ryant (LDC), Mark Liberman (LDC), Andreas Stolcke, Vikramjit Mitre, Wen Wang
- Wednesday 28 August, Oral 32, 14.00 – 16.00
- Room: Gratte-Ciel 3
New
publications:
(1) GALE
Phase
2 Chinese Broadcast Conversation Parallel Text Part 2 was
developed by LDC. Along with other corpora, the parallel text in
this release comprised training data for Phase 2 of the DARPA GALE
(Global Autonomous Language Exploitation) Program. This corpus
contains Chinese source text and corresponding English
translations selected from broadcast conversation (BC) data
collected by LDC in 2005-2007 and transcribed by LDC or under its
direction.
This release includes 20 source-translation
document pairs, comprising 152,894 characters of Chinese source
text and its English translation. Data is drawn from six distinct
Chinese programs broadcast in 2005-2007 from Phoenix TV, a Hong
Kong-based satellite television station. Broadcast conversation
programming is generally more interactive than traditional news
broadcasts and includes talk shows, interviews, call-in programs
and roundtable discussions. The programs in this release focus on
current events topics.
The data was transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with
Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to
several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's
Chinese to English translation guidelines. Bilingual LDC staff
performed quality control procedures on the completed
translations.
GALE Phase 2 Chinese Broadcast Conversation
Parallel Text Part 2 is distributed via web download. 2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) MADCAT
(Multilingual
Automatic Document Classification Analysis and Translation)
Phase 3 Training Set contains all training data created by
LDC to support Phase 3 of the DARPA MADCAT Program. The data in
this release consists of handwritten Arabic documents, scanned at
high resolution and annotated for the physical coordinates of each
line and token. Digital transcripts and English translations of
each document are also provided, with the various content and
annotation layers integrated in a single MADCAT XML output.
The goal of the MADCAT program is to
automatically convert foreign text images into English
transcripts. MADCAT Phase 3 data was collected from Arabic source
documents in three genres: newswire, weblog and newsgroup text.
Arabic speaking scribes copied documents by hand, following
specific instructions on writing style (fast, normal, careful),
writing implement (pen, pencil) and paper (lined, unlined). Prior
to assignment, source documents were processed to optimize their
appearance for the handwriting task, which resulted in some
original source documents being broken into multiple pages for
handwriting. Each resulting handwritten page was assigned to up to
five independent scribes, using different writing conditions.
The handwritten, transcribed documents were
next checked for quality and completeness, then each page was
scanned at a high resolution (600 dpi, greyscale) to create a
digital version of the handwritten document. The scanned images
were then annotated to indicate the physical coordinates of each
line and token. Explicit reading order was also labeled, along
with any errors produced by the scribes when copying the text.
The final step was to produce a unified data
format that takes multiple data streams and generates a single
MADCAT XML output file which contains all required information.
The resulting madcat.xml file contains distinct components: a text
layer that consists of the source text, tokenization and sentence
segmentation; an image layer that consists of bounding boxes; a
scribe demographic layer that consists of scribe ID and partition
(train/test); and a document metadata layer. This release includes 4,540 annotation files in
both GEDI XML and MADCAT XML formats (gedi.xml and madcat.xml)
along with their corresponding scanned image files in TIFF format.
MADCAT Phase 3 Training Set is distributed on one DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
MADCAT Phase 3 Training Set is distributed on one DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(3) Mixer
6
Speech was developed by LDC and is comprised of 15,863 hours
of telephone speech, interviews and transcript readings from 594
distinct native English speakers. This material was collected by
LDC in 2009 and 2010 as part of the Mixer project, specifically
phase 6, the focus of which was on native American English
speakers local to the Philadelphia area.
The speech data in this release was collected
by LDC at its Human Subjects Collection facilities in
Philadelphia. The telephone collection protocol was similar to
other LDC telephone studies (e.g., Switchboard-2 Phase III Audio -
LDC2002S06):
recruited
speakers were connected through a robot operator to carry on
casual conversations lasting up to 10 minutes, usually about a
daily topic announced by the robot operator at the start of the
call. The raw digital audio content for each call side was
captured as a separate channel, and each full conversation was
presented as a 2-channel interleaved audio file, with 8000
samples/second and u-law sample encoding. Each speaker was asked
to complete 15 calls.
The multi-microphone portion of the collection
utilized 14 distinct microphones installed identically in two
mutli-channel audio recording rooms at LDC. Each session was
guided by collection staff using prompting and recording software
to conduct the following activities: (1) repeat questions (less
than one minute), (2) informal conversation (typically 15
minutes), (3) transcript reading (approximately 15 minutes) and
(4) telephone call (generally 10 minutes). Speakers recorded up to
three 45-minute sessions on distinct days. The 14 channels were
recorded synchronously into separate single-channel files, using
16-bit PCM sample encoding at 16000 samples/second.
The recordings in this corpus were used in NIST
Speaker Recognition Evaluation (SRE) test sets for 2010 and 2012.
Researchers interested in applying those benchmark test sets
should consult the respective NIST Evaluation Plans
for guidelines on allowable training data for those tests. The collection contains 4,410 recordings made
via the public telephone network and 1,425 sessions of multiple
microphone recordings in office-room settings. The telephone
recordings are presented as 8-KHz 2-channel NIST SPHERE files, and
the microphone recordings are 16-KHz 1-channel flac/ms-wav files.