Fall 2018 Data Scholarship Program
New Publications:
New Publications:
_________________________________________________________________________
Fall 2018 LDC Data Scholarship Program
Student applications for the Fall 2018 LDC Data Scholarship
program are being accepted now through September 15, 2018. This scholarship
program provides university students with access to LDC data at no cost.
Students must complete an application which consists of a data use proposal and
letter of support from their advisor.
For application requirements and program rules, please visit
the LDC
Data Scholarship page.
New publications:
(1) CALLFRIEND
Mandarin Chinese-Mainland Dialect Second Edition was developed by LDC and
consists of approximately 24 hours of unscripted telephone conversations
between native speakers of the Mandarin Chinese dialect spoken in mainland
China. This second edition updates the audio files to wav format, simplifies
the directory structure and adds documentation and metadata. The first edition
is available as CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55).
All data was collected before July 1997. Participants could
speak with a person of their choice on any topic; most called family members
and friends. All calls originated in North America. The recorded conversations
last up to 30 minutes.
CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition
is distributed via web download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(2) RATS
Language Identification was developed by LDC and is comprised of
approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu
conversational telephone speech with annotation of speech segments. The corpus
was created to provide training, development and initial test sets for the
Language Identification (LID) task in the DARPA RATS (Robust Automatic
Transcription of Speech) program.
The source audio consists of conversational telephone speech
recordings from: (1) conversational telephone speech (CTS) recordings, taken
either from previous LDC CTS corpora, or from CTS data collected specifically
for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native
speakers; and (2) portions of VOA broadcast news recordings, taken from data
used in the 2009 NIST
Language Recognition Evaluation. The 2009 LRE Test Set is available from
LDC as LDC2014S06.
CTS recordings were audited by annotators who listened to
short segments and determined whether the audio was in the target language.
Annotations on the audio files include start time, end time, speech activity
detection (SAD) label, SAD provenance, language ID and LID provenance.
RATS Language Identification is distributed via hard drive.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(3) TRAD
Chinese-French Parallel Text -- Broadcast News was developed by ELDA as
part of the PEA-TRAD project. It contains French translations of a subset of
approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast
News Parallel Text - Part 3 (LDC2008T18).
The purpose of the PEA-TRAD project (Translation as a Support for Document
Analysis) was to develop speech-to-speech translation technology for multiple
languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.
This release consists of 977 segments (translation units)
from 139 documents. The Chinese source file contains 33,571 characters and the
French reference translation contains 22,424 words. The source data is
Chinese broadcast news collected and translated into English by LDC for the DARPA
GALE (Global Autonomous Language Exploitation) program. Information about the
ELDA translation team, translation guidelines and validation results is
contained in the documentation accompanying this release.
TRAD Chinese-French Parallel Text – Broadcast News is
distributed via web download.
2018 Subscription Members will receive copies of this corpus
provided they have submitted a completed copy of the special license agreement.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.