New publications:
Applications are now being accepted through
Monday, September 15, 2014, 11:59PM EST for the Fall 2014 LDC
Data Scholarship program. The LDC Data Scholarship program
provides university students with access to LDC data at no-cost.
This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.
The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.
Applicants should consult the LDC Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.
(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full non-member fee for the data and verify the student's need for data.
For further information on application materials and program rules, please visit the LDC Data Scholarship page.
This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.
The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.
Applicants should consult the LDC Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.
(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full non-member fee for the data and verify the student's need for data.
For further information on application materials and program rules, please visit the LDC Data Scholarship page.
New publications
(1) 2009 NIST Language Recognition
Evaluation Test Set contains approximately 215
hours of conversational telephone speech and radio broadcast
conversation collected by LDC in the following 23 languages and
dialects: Amharic, Bosnian, Cantonese, Creole (Haitian),
Croatian, Dari, English (American), English (Indian), Farsi,
French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto,
Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and
Vietnamese.
The goal of the NIST
(National Institute of Standards and Technology)
Language Recognition Evaluation (LRE)
is to establish the baseline of current performance capability
for language recognition of conversational telephone speech and
to lay the groundwork for further research efforts in the field.
NIST conducted language recognition evaluations in 1996, 2003, 2005 and 2007. The 2009 evaluation increased the
number of target languages. Most of the test data originated
from multilingual Voice of America (VOA) radio broadcasts
assessed as being of telephone bandwidth in addition to
conversational telephone speech. Further information regarding
this evaluation can be found in the evaluation plan which is
included in the documentation for this release.
LDC released the prior LREs as:
2003 NIST Language Recognition Evaluation (LDC2006S31)2005 NIST Language Recognition Evaluation (LDC2008S05)2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)
The VOA speech data was collected by LDC in
2000 and 2001 and constitutes approximately 75% of the test set.
The telephone speech was taken from LDC's Mixer 3 collection
recorded between 2005 and 2007.
All test speech segments are presented as a sampled data stream in standard 8-bit 8-kHz μ-law format. Each segment is stored separately in a single channel SPHERE format file. The test segments contain three nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively.
2009 NIST Language Recognition Evaluation
Test Set is distributed on 2 DVD-ROM. 2014 Subscription Members will automatically
receive two copies of this data. 2014 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(2) GALE Arabic-English Word Alignment
Training Part 3 -- Web was developed by LDC and
contains 217,158 tokens of word aligned Arabic and English
parallel text enriched with linguistic tags. This material was
used as training data in the DARPA GALE (Global Autonomous
Language Exploitation) program.
Some approaches to statistical machine
translation include the incorporation of linguistic knowledge in
word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of
word tags and alignment link tags are designed in the tagging
scheme to describe these translation units and relations.
Tagging adds contextual, syntactic and language-specific
features to the alignment annotation.
Other releases available in this series are:
GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16)GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web (LDC2014T05)GALE Arabic-English Word Alignment Training Part 2 -- Newswire (LDC2014T10)
This release consists of Arabic source web
data collected by LDC. The distribution by genre, words,
character tokens and segments appears below:
Language
|
Genre
|
Files
|
Words
|
CharTokens
|
Segments
|
Arabic
|
WB
|
2,449
|
154,144
|
217,158
|
7,332
|
Note that word count is based on the
untokenized Arabic source, and token count is based on the
tokenized Arabic source.
The Arabic word alignment tasks consisted of
the following components:
Normalizing tokenized tokens as neededIdentifying different types of linksIdentifying sentence segments not suitable for annotationTagging unmatched words attached to other words or phrases
GALE Arabic-English Word Alignment Training
Part 3 -- Web is distributed via web download. 2014 Subscription Members will automatically
receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(3) GALE Phase 2 Chinese Newswire Parallel
Text Part 1 was developed by LDC. Along with
other corpora, the parallel text in this release comprised
training data for Phase 2 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. This corpus contains 117,173
tokens of Chinese source text and corresponding English
translations selected from newswire data collected by LDC in
2007 and transcribed by LDC or under its direction.
This release includes 167 source-translation
document pairs, comprising 117,173 tokens of translated data.
Data is drawn from four distinct Chinese newswire sources: China
News Service, Guangming Daily, People's Daily and People's
Liberation Army Daily.
The data was transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with
Quick Rich Transcription guidelines developed by LDC.
Transcribers indicated sentence boundaries in addition to
transcribing the text. Data was manually selected for
translation according to several criteria, including linguistic
features, transcription features and topic features. The
transcribed and segmented files were then reformatted into a
human-readable translation format and assigned to translation
vendors. Translators followed LDC's Chinese to English
translation guidelines. Bilingual LDC staff performed quality
control procedures on the completed translations.
Source data and translations are distributed
in TDF format. TDF files are tab-delimited files containing one
segment of text along with meta information about that segment.
Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.
GALE Phase 2 Chinese Newswire Parallel Text
Part 1 is distributed via web download. 2014 Subscription Members will automatically receive two copies of
this data on disc. 2014 Standard Members may request a copy as
part of their 16 free membership corpora. Non-members may license
this data for a fee.