Fall 2016 Data Scholarship Program
LDC at Interspeech 2016
New Publications:
_______________________________________________________________________
Fall 2016 LDC Data
Scholarship program - September 15 deadline approaching
Student applications for the Fall 2016 LDC
Data Scholarship program are being accepted now through Thursday, September 15,
2016, 11:59PM EST. The LDC Data Scholarship program provides university
students with access to LDC data at no cost. Students must complete an
application which consists of a data use proposal and letter of support from
their advisor.
For more information on application requirements and program rules, please visit the LDC Data Scholarship page.
For more information on application requirements and program rules, please visit the LDC Data Scholarship page.
Applicants can email their materials to
the LDC
Data Scholarship program.
LDC at Interspeech 2016
LDC will once again be exhibiting at Interspeech, held
this year September 9-12 in San Francisco, California. Stop by booth 17 to
learn more about recent developments at the Consortium and new publications.
Also, be on the lookout for the following
presentations featuring LDC work:
Automatic Analysis of
Phonetic Speech Style Dimensions: Neville Ryant and Mark Liberman (both LDC)
Friday 9 September, Oral Session, Bayview A, 11:00am
Friday 9 September, Oral Session, Bayview A, 11:00am
The Rhythmic Constraint
on Prosodic Boundaries in Mandarin Chinese Based on Corpora of Silent Reading
and Speech Perception: Wei Lai (UPenn), Jiahong Yuan (LDC), Ya Li (Chinese Academy of Science), Xiaoying Xu (Beijing Normal University) and Mark Liberman (LDC)
Friday 9 September, Oral Session, Bayview A, 11:00am
Friday 9 September, Oral Session, Bayview A, 11:00am
Pitch-range Perception:
the Dynamic Interaction Between Voice Quality and Fundamental Frequency: Jianjing Kuang
(UPenn) and Mark Liberman (LDC)
Saturday 10 September, Poster Session A, 10:00am
Saturday 10 September, Poster Session A, 10:00am
Phoneme, Phone Boundary,
and Tone in Automatic Scoring of Mandarin Proficiency: Jiahong Yuan and
Mark Liberman (both LDC)
Sunday 11 September, Poster Session A, 10:00am
Sunday 11 September, Poster Session A, 10:00am
LDC will post conference updates via
our Twitter feed and Facebook
page. We hope to see you there!
New Publications
(1) IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b was
developed by Appen for
the IARPA (Intelligence Advanced Research Projects Activity) Babel program.
It contains approximately 215 hours of Bengali conversational and scripted
telephone speech collected in 2011 and 2012 along with corresponding
transcripts.
The Babel program focuses on underserved
languages and seeks to develop speech recognition technology that can be
rapidly applied to any human language to support keyword search performance
over large amounts of recorded speech.
The Bengali speech in this release represents
that spoken in India by native speakers of Bengali born in India. The gender
distribution among speakers is approximately even; speakers' ages range from 16
years to 65 years. Calls were made using different telephones (e.g., mobile, landline)
from a variety of environments.
All audio data is presented as 8kHz 8-bit
a-law encoded audio in sphere format. Transcripts are available in two
versions: the Bengali script and a romanization scheme developed by Appen
Butler Hill, both encoded in UTF-8.
2016 Subscription Members will receive two
copies of this corpus provided they have submitted a completed copy of the
IARPA User Agreement for Not-for-Profit Members or the IARPA User Agreement for
For-Profit Members. 2016 Standard Members may request a copy as part of their
16 free membership corpora. Non-members may license this data for a fee under a research license.
*
(2) IARPA Babel
Assamese Language Pack IARPA-babel102b-v0.5a was developed
by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program.
It contains approximately 205 hours of Assamese conversational and scripted
telephone speech collected in 2012 and 2013 along with corresponding
transcripts.
The Babel program focuses on underserved
languages and seeks to develop speech recognition technology that can be
rapidly applied to any human language to support keyword search performance
over large amounts of recorded speech.
The speech in this release represents three
dialects spoken in Assam, a state in northeastern India. The gender
distribution among speakers is approximately even; speakers' ages range from 16
years to 66 years. Calls were made using different telephones (e.g., mobile,
landline) from a variety of environments.
All audio data is presented as 8kHz 8-bit
a-law encoded audio in sphere format. Transcripts are available in two
versions: Assamese script and a romanization scheme developed by Appen Butler
Hill, both encoded in UTF-8.
2016 Subscription Members will receive two
copies of this corpus provided they have submitted a completed copy of the
IARPA User Agreement for Not-for-Profit Members or the IARPA User Agreement for
For-Profit Members. 2016 Standard Members may request a copy as part of their
16 free membership corpora. Non-members may license this data for a fee under a research license.
*
(3) GALE Phase 3
Arabic Broadcast News Speech Part 1 was developed by LDC and is
comprised of approximately 132 hours of Arabic broadcast news speech collected
in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3
of the DARPA GALE (Global Autonomous Language Exploitation) program.
Corresponding transcripts are released as
GALE Phase 3 Arabic Broadcast News Transcripts Part 1 (LDC2016T17).
The broadcast news recordings in this corpus
feature news broadcasts focusing principally on current events from various
broadcast programmers including Abu Dhabi TV, Al Alam News Channel, Al Arabiya,
Al Iraqiyah, Aljazeera, Al Ordiniyah, Dubai TV, Kuwait TV, Lebanese Broadcast
Corporation, Nile TV, Saudi TV and Syria TV.
This release contains 175 audio files
presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz
single-channel 16-bit PCM. Each file was audited by a native Arabic speaker.
2016 Subscription Members will automatically
receive two copies of this corpus. 2016 Standard Members may request a copy as
part of their 16 free membership corpora. Non-members may license this data for
a fee.
*
(4) GALE Phase 3
Arabic Broadcast News Transcripts Part 1 was developed by LDC and
contains transcriptions of approximately 132 hours of Arabic broadcast news
speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language
Exploitation) program.
Corresponding audio data is released as GALE
Phase 3 Arabic Broadcast News Speech Part 1 (LDC2016S07).
The transcript files are in plain-text, tab-delimited
format (TDF) with UTF-8 encoding, and the transcribed data totals 741,689
tokens. The transcripts were created with the LDC tool, XTrans, which supports
manual transcription and annotation of audio recordings. XTrans is available
from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
The files in this corpus were transcribed by
LDC staff and/or by transcription vendors under contract to LDC. Transcribers
followed LDC's quick transcription guidelines (QTR) and quick rich
transcription specification (QRTR) both of which are included in the
documentation with this release.
2016 Subscription Members will automatically
receive two copies of this corpus. 2016 Standard Members may request a copy as
part of their 16 free membership corpora. Non-members may license this data for
a fee.