Linguistic Data Consortium: August 2016

Fall 2016 Data Scholarship Program

LDC at Interspeech 2016

New Publications:

IARPA Babel Bengali Language Pack

IARPA Babel Assamese Language Pack

GALE Phase 3 Arabic Broadcast News Speech Part 1

GALE Phase 3 Arabic Broadcast News Transcripts Part 1

_______________________________________________________________________

Fall 2016 LDC Data Scholarship program - September 15 deadline approaching

Student applications for the Fall 2016 LDC Data Scholarship program are being accepted now through Thursday, September 15, 2016, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page.

Applicants can email their materials to the LDC Data Scholarship program.

LDC at Interspeech 2016

LDC will once again be exhibiting at Interspeech, held this year September 9-12 in San Francisco, California. Stop by booth 17 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Automatic Analysis of Phonetic Speech Style Dimensions: Neville Ryant and Mark Liberman (both LDC)
Friday 9 September, Oral Session, Bayview A, 11:00am

The Rhythmic Constraint on Prosodic Boundaries in Mandarin Chinese Based on Corpora of Silent Reading and Speech Perception: Wei Lai (UPenn), Jiahong Yuan (LDC), Ya Li (Chinese Academy of Science), Xiaoying Xu (Beijing Normal University) and Mark Liberman (LDC)
Friday 9 September, Oral Session, Bayview A, 11:00am

Pitch-range Perception: the Dynamic Interaction Between Voice Quality and Fundamental Frequency: Jianjing Kuang (UPenn) and Mark Liberman (LDC)
Saturday 10 September, Poster Session A, 10:00am

Phoneme, Phone Boundary, and Tone in Automatic Scoring of Mandarin Proficiency: Jiahong Yuan and Mark Liberman (both LDC)
Sunday 11 September, Poster Session A, 10:00am

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

New Publications

(1) IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 215 hours of Bengali conversational and scripted telephone speech collected in 2011 and 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Bengali speech in this release represents that spoken in India by native speakers of Bengali born in India. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: the Bengali script and a romanization scheme developed by Appen Butler Hill, both encoded in UTF-8.

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the IARPA User Agreement for Not-for-Profit Members or the IARPA User Agreement for For-Profit Members. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee under a research license.

(2) IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 205 hours of Assamese conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The speech in this release represents three dialects spoken in Assam, a state in northeastern India. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 66 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: Assamese script and a romanization scheme developed by Appen Butler Hill, both encoded in UTF-8.

(3) GALE Phase 3 Arabic Broadcast News Speech Part 1 was developed by LDC and is comprised of approximately 132 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News Transcripts Part 1 (LDC2016T17).

The broadcast news recordings in this corpus feature news broadcasts focusing principally on current events from various broadcast programmers including Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Iraqiyah, Aljazeera, Al Ordiniyah, Dubai TV, Kuwait TV, Lebanese Broadcast Corporation, Nile TV, Saudi TV and Syria TV.

This release contains 175 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) GALE Phase 3 Arabic Broadcast News Transcripts Part 1 was developed by LDC and contains transcriptions of approximately 132 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News Speech Part 1 (LDC2016S07).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 741,689 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

Linguistic Data Consortium

Monday, August 15, 2016

LDC August 2016 Newsletter