Fall 2017 LDC Data Scholarship program
LDC at Interspeech 2017
LDC at Interspeech 2017
New Publications:
Multi-Language
Conversational Telephone Speech 2011 -- South Asian
GALE Phase 4 Arabic Broadcast Conversation Speech
GALE Phase 4 Arabic Broadcast Conversation Transcripts
GALE Phase 4 Arabic Broadcast Conversation Speech
GALE Phase 4 Arabic Broadcast Conversation Transcripts
________________________________________________________________
Fall 2017 LDC Data Scholarship program -
September 15 deadline approaching
There is still time to apply to the Fall 2017
LDC Data Scholarship program. Applications will be accepted through Friday
September 15, 2017. The LDC Data Scholarship program provides university
students with access to LDC data at no cost. Students must complete an
application which consists of a data use proposal and letter of support from
their advisor.
For more information on application requirements and program rules, please visit the LDC Data Scholarship page.
For more information on application requirements and program rules, please visit the LDC Data Scholarship page.
Applicants can email their materials to
the LDC Data
Scholarship program.
LDC at Interspeech 2017
LDC will once again be exhibiting at Interspeech,
held this year August 20-24 in Stockholm, Sweden. Stop by booth 17 to
learn more about recent developments at the Consortium and new publications.
Also, be on the lookout for the following presentations featuring LDC work:
Speaker Comparison for Forensic and Investigative Applications III
LDC Executive Director, Chris Cieri, panelist for Topic A: “Process Map and Standardization”
Special Event Session, Wednesday August 23, 13:30-15:30, Hall B3
Call My Net Corpus: A Multilingual Corpus for Evaluation of Speaker Recognition Technology
Karen Jones, Stephanie Strassel, Kevin
Walker, David Graff, Jonathan Wright
Wednesday, August 23, 17:40-18:00 in the Agula
Magna room
LDC will post conference
updates via our Twitter
feed and Facebook
page. We hope to see you there!
New publications:
(1) Multi-Language Conversational
Telephone Speech 2011 -- South Asian was
developed by LDC and is comprised of approximately 118 hours of telephone
speech in five distinct language varieties of South Asia (i.e. the Indian
sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu. The data were
collected primarily to support research and technology evaluation in automatic
language identification, and portions of these telephone calls were used in the
NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24
languages/dialects, some which could be considered mutually intelligible or
closely related.
Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type, and noise. Demographic information about the participants was not collected.
LDC has also released the following as part of the Multi-Language Conversation Telephone Speech 2011 series: Slavic Group (LDC2016S11) and Turkish (LDC2017S09).
Multi-Language Conversational Telephone Speech 2011 -- South Asian is distributed via web download.
Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type, and noise. Demographic information about the participants was not collected.
LDC has also released the following as part of the Multi-Language Conversation Telephone Speech 2011 series: Slavic Group (LDC2016S11) and Turkish (LDC2017S09).
Multi-Language Conversational Telephone Speech 2011 -- South Asian is distributed via web download.
2017 Subscription Members will receive copies of this
corpus. 2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(2) GALE
Phase 4 Arabic Broadcast Conversation Speech was developed by LDC and is
comprised of approximately 75 hours of Arabic broadcast conversation speech
collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program.
Corresponding transcripts are released as GALE Phase 4
Arabic Broadcast Conversation Transcripts (LDC2017T12).
This release contains 83 audio files presented in
FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Arabic speaker following Audit
Procedure Specification Version 2.0 which is included in this release.
The broadcast conversation recordings in this release
feature interviews, call-in programs and roundtable discussions focusing
principally on current events from the following sources: Al Alam News Channel,
based in Iran; Al Fayhaa, an Iraqi television channel; Al Hiwar, a regional
broadcast station based in the United Kingdom; Alnurra, a U.S.
government-funded regional broadcaster; Aljazeera, a regional broadcaster
located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan;
Dubai TV, a broadcast station in the United Arab Emirates; Lebanese Broadcasting
Corporation, a Lebanese television station; Saudi TV, a national television
station based in Saudi Arabia; Syria TV, the national television station in
Syria; and Tunisian National TV, a national television station in Tunisia.
GALE Phase 4 Arabic Broadcast Conversation Speech is distributed via web download.
GALE Phase 4 Arabic Broadcast Conversation Speech is distributed via web download.
2017 Subscription Members
will receive copies of this corpus. 2017 Standard Members may request a copy as
part of their 16 free membership corpora. Non-members may license this data for
a fee.
*
(3) GALE Phase 4 Arabic Broadcast Conversation Transcripts was developed by LDC and contains transcriptions of approximately 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) program.
Corresponding audio data is released as GALE Phase 4
Arabic Broadcast Conversation Speech (LDC2017S15).
The transcript files are in plain-text, tab-delimited
format (TDF) with UTF-8 encoding, and the transcribed data totals 475,211
tokens. The files in this corpus were transcribed by LDC staff and/or by transcription
vendors under contract to LDC. Transcribers followed LDC's quick transcription
guidelines (QTR) and quick rich transcription specification (QRTR) both of
which are included in the documentation with this release.
GALE Phase 4 Arabic Broadcast Conversation Transcripts is distributed via web download.
GALE Phase 4 Arabic Broadcast Conversation Transcripts is distributed via web download.
2017 Subscription Members will
receive copies of this corpus. 2017 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data for a
fee.