Linguistic Data Consortium: August 2013

Mixer 6 now available

Fall 2013 LDC Data Scholarship Program - deadline approaching!

LDC at Interspeech 2013, Lyon France

New publications:

GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2
MADCAT Phase 3 Training Set
Mixer 6 Speech

Mixer 6 now available!

The release of Mixer 6 Speech this month marks the first time in close to a decade that LDC has made available a large-scale speech training data collection. Representing more than 15,000 hours of speech from over 500 speakers, Mixer 6 follows in the footsteps of the Switchboard and Fisher studies by providing a large database of rich telephone conversations with the addition of subject interviews and transcript readings. Participants were native American English speakers local to the Philadelphia area, providing further scope for a variety of research tasks. Mixer 6 Speech is a members-only release and a great reason to join the consortium. In addition to this substantial resource, members enjoy rights to other data released in 2013 and can license older publications at reduced fees.

Please see the full description of Mixer 6 Speech.

Fall 2013 LDC Data Scholarship Program - deadline approaching!

The deadline for the Fall 2013 LDC Data Scholarship Program is one month away! Student applications are being accepted now through September 16, 2013, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

LDC at Interspeech 2013, Lyon France

LDC will once again be exhibiting at Interspeech held this year August 25-29 in Lyon. Please stop by LDC’s booth to to learn about recent developments at the Consortium, including new publications.

Also, be on the lookout for the following presentations:

Speech Activity Detection on YouTube Using Deep Neural Networks

Neville Ryant, Mark Liberman, Jiahong Yuan (all LDC)
Monday 26 August, Poster 6, 16.00 – 18.00
Room: Forum 6

The Spectral Dynamics of Vowels in Mandarin Chinese

Jiahong Yuan (LDC)
Tuesday 27 August, Oral 17, 14.00 – 16.00
Room: Gratte-Ciel 3

Automatic Phonetic Segmentation using Boundary Models

Jiahong Yuan (LDC), Neville Ryant (LDC), Mark Liberman (LDC), Andreas Stolcke, Vikramjit Mitre, Wen Wang
Wednesday 28 August, Oral 32, 14.00 – 16.00
Room: Gratte-Ciel 3

LDC will continue to post conference updates via our Facebook page. We hope to see you there!

New publications:

(1) GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC in 2005-2007 and transcribed by LDC or under its direction.

This release includes 20 source-translation document pairs, comprising 152,894 characters of Chinese source text and its English translation. Data is drawn from six distinct Chinese programs broadcast in 2005-2007 from Phoenix TV, a Hong Kong-based satellite television station. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics.

The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phase 3 Training Set contains all training data created by LDC to support Phase 3 of the DARPA MADCAT Program. The data in this release consists of handwritten Arabic documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output.

The goal of the MADCAT program is to automatically convert foreign text images into English transcripts. MADCAT Phase 3 data was collected from Arabic source documents in three genres: newswire, weblog and newsgroup text. Arabic speaking scribes copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple pages for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions.

The handwritten, transcribed documents were next checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.

The final step was to produce a unified data format that takes multiple data streams and generates a single MADCAT XML output file which contains all required information. The resulting madcat.xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consists of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. This release includes 4,540 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and madcat.xml) along with their corresponding scanned image files in TIFF format.

MADCAT Phase 3 Training Set is distributed on one DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Mixer 6 Speech was developed by LDC and is comprised of 15,863 hours of telephone speech, interviews and transcript readings from 594 distinct native English speakers. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area.

The speech data in this release was collected by LDC at its Human Subjects Collection facilities in Philadelphia. The telephone collection protocol was similar to other LDC telephone studies (e.g., Switchboard-2 Phase III Audio - LDC2002S06): recruited speakers were connected through a robot operator to carry on casual conversations lasting up to 10 minutes, usually about a daily topic announced by the robot operator at the start of the call. The raw digital audio content for each call side was captured as a separate channel, and each full conversation was presented as a 2-channel interleaved audio file, with 8000 samples/second and u-law sample encoding. Each speaker was asked to complete 15 calls.

The multi-microphone portion of the collection utilized 14 distinct microphones installed identically in two mutli-channel audio recording rooms at LDC. Each session was guided by collection staff using prompting and recording software to conduct the following activities: (1) repeat questions (less than one minute), (2) informal conversation (typically 15 minutes), (3) transcript reading (approximately 15 minutes) and (4) telephone call (generally 10 minutes). Speakers recorded up to three 45-minute sessions on distinct days. The 14 channels were recorded synchronously into separate single-channel files, using 16-bit PCM sample encoding at 16000 samples/second.

The recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) test sets for 2010 and 2012. Researchers interested in applying those benchmark test sets should consult the respective NIST Evaluation Plans for guidelines on allowable training data for those tests. The collection contains 4,410 recordings made via the public telephone network and 1,425 sessions of multiple microphone recordings in office-room settings. The telephone recordings are presented as 8-KHz 2-channel NIST SPHERE files, and the microphone recordings are 16-KHz 1-channel flac/ms-wav files.

Mixer 6 Speech is distributed on one hard drive. 2013 Subscription Members will automatically receive one copy of this data on hard drive. 2013 Standard Members may request a copy as part of their 16 free membership corpora. As a Members-Only release, Mixer 6 Speech is not available for non-member licensing.

Linguistic Data Consortium

Monday, August 19, 2013

LDC August 2013 Newsletter