Spring
2018 LDC Data Scholarship Program - deadline approaching
Lingo
Boingo: a web portal to language games
Renew your LDC
membership today
New Publications:
CHiME3
GALE Phase 4 Chinese Broadcast News Speech
GALE Phase 4 Chinese Broadcast News Transcripts
New Publications:
CHiME3
GALE Phase 4 Chinese Broadcast News Speech
GALE Phase 4 Chinese Broadcast News Transcripts
__________________________________________________________________________
Spring 2018 LDC Data Scholarship Program - deadline
approaching
Students can apply for the Spring 2018 Data Scholarship
Program now through January 15, 2018. The LDC Data Scholarship program
provides students with access to LDC data at no cost. For more information on
application requirements and program rules, please visit LDC
Data Scholarships.
Lingo Boingo: a web portal to
language games
LDC
is pleased to announce a new collaborative project, Lingo Boingo (https://lingoboingo.org/), a web portal that brings
together new and existing language games that are fun to play and that provide
useful annotations and judgments for linguistic research. Gamers and grammar
lovers can choose from a list of challenging games, which will continue to
expand through the efforts of LDC and external collaborators. For more
information, contact jfiumara@ldc.upenn.edu. Start playing today!
Renew your LDC
membership today
Membership Year 2018 (MY2018) is open for joining and
discounts are available for those who keep their membership current and join
early in the year. Now through March 1, 2018, current MY2017 members who renew
before March 1, will receive a 10% discount off of the membership fee. New or
returning organizations will receive a 5% discount through March 1.
In addition to receiving new publications, current year LDC
members also enjoy the benefit of licensing older data at reduced costs from
our Catalog of over 700 holdings; current year for-profit members may use most
data for commercial applications. Visit Join LDC for details
on membership, user accounts and payment.
Plans for MY2018 publications are in progress. Among the
expected releases are:
- Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
- DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
- TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
- IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
- BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
- DEFT: Spanish Treebank (newswire, web data)
- RATS: Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
- TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
- German children’s handwriting: longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns
New
publications:
(1) CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition
Challenge and contains approximately 342 hours of English speech and
transcripts from noisy environments and 50 hours of noisy environment audio.
The CHiME Challenges focus on distant-microphone automatic speech recognition
(ASR) in real-world environments. CHiME3 involved two types of data: speech
data recorded in very noisy environments (on a bus, in a cafe, pedestrian area,
and street junction) and noisy utterances generated by artificially mixing
clean speech data with noisy backgrounds.
Data
is divided into training, development and test sets. All data is provided as 16
bit WAV files sampled at 16 kHz. The audio data consists of the background
noises, enhanced speech data using the baseline speech enhancement technique,
unsegmented noisy speech data, and segmented noisy speech data.
LDC
has also released two CHiME2 corpora -- CHiME2
Grid and CHiME2
WSJ0.
CHiME3
is distributed via USB drive.
2017
Subscription Members will receive copies of this corpus. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(2) GALE Phase 4 Chinese Broadcast News Speech was developed by
LDC and is comprised of approximately 134 hours of Mandarin Chinese broadcast
news speech collected in 2008 by LDC and Hong University of Science and
Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global
Autonomous Language Exploitation) Program.
Corresponding
audio data is released as GALE Phase 4 Chinese Broadcast News Transcripts
(LDC2017T18).
The broadcast news recordings in this release feature news
broadcasts focusing principally on current events from the following sources:
China Central TV (CCTV), a national and international broadcaster in Mainland
China; Phoenix TV, a Hong Kong-based satellite television station; and Voice of
America (VOA), a U.S. government-funded broadcast programmer.
This release contains 256 audio files presented in
FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Chinese speaker following Audit
Procedure Specification Version 2.0 which is included in this release.
GALE Phase 4 Chinese Broadcast News Speech is distributed
via web download.
2017 Subscription Members will receive copies of this
corpus. 2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(3) GALE Phase 4 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding
audio data is released as GALE Phase 4 Chinese Broadcast News Speech
(LDC2017S25).
The
transcript files are in plain-text, tab-delimited format (TDF) with UTF-8
encoding, and the transcribed data totals 1,696,879 tokens. The transcripts
were created with the LDC tool, XTrans, which supports manual transcription and
annotation of audio recordings.
GALE
Phase 4 Chinese Broadcast News Transcripts is distributed via web download.
2017
Subscription Members will receive copies of this corpus. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.