Spring 2014 LDC Data Scholarship recipients
Membership fee savings and publications pipeline
New LDC website enhancements coming soon
New publications:
LDC is pleased to announce the
student recipients of the Spring 2014 LDC Data Scholarship program! This program
provides university students with access to LDC data at no-cost. Students were
asked to complete an application which consisted of a proposal describing their
intended use of the data, as well as a letter of support from their thesis
adviser. We received many solid applications and have chosen two proposals to support. The following students will receive no-cost copies of LDC data:
- Skye Anderson ~ Tulane University (USA), BA candidate,
Linguistics. Skye has been awarded a copy of LDC Standard Arabic
Morphological Analyzer (SAMA) Version 3.1 for her work in author
profiling.
- Hao Liu ~ University College London (UK), PhD
candidate, Speech, Hearing and Phonetic Sciences. Hao has been
awarded a copy of Switchboard-1 Release 2, and NXT Switchboard Annotations
for his work in prosody modeling.
Members can still save on 2014
membership fees, but time is running out. Any organization which joins or
renews membership for 2014 through Monday, March 3, 2014, is entitled to a 5%
discount. Organizations which held membership for MY2013 can receive a 10%
discount on fees provided they renew prior to March 3, 2014.
Planned publications for this year
include:
- 2009 NIST Language Recognition Evaluation ~
development data from VOA broadcast and CTS telephone speech in target and
non-target languages.
- ETS Corpus of Non-Native Written English ~ contains
1100 essays written for a college-entrance test sampled from eight prompts
(i.e., topics) with score levels
(low/medium/high) for each essay.
- GALE data ~ including Word Alignment, Broadcast Speech
& Transcripts, Parallel Text, Parallel Aligned Treebanks in Arabic,
Chinese, and English.
- Hispanic Accented English ~ contains approximately 30
hours of spontaneous speech and read utterances from non-native speakers
of English with corresponding transcripts.
- Multi-Channel Wall Street Journal Audio-Visual Corpus
(MC-WSJ-AV) ~ re-recording of parts of the WSJCAM0 using a number of
microphones as well as three recording conditions resulting in 18-20
channels of audio per recording.
- TAC KBP Reference Knowledge Base ~ TAC KBP aims to
develop and evaluate technologies for building and populating knowledge
bases (KBs) about named entities from unstructured text. KBP systems
must either populate an existing reference KB, or else build a KB from
scratch. The reference KB for is based on a snapshot of English Wikipedia
snapshot from October 2008 and contains a set of entities, each with a
canonical name and title for the Wikipedia page, an entity type, an
automatically parsed version of the data from the infobox in the entity's
Wikipedia article, and a stripped version of the text of the Wiki article.
- USC-SFI MALACH Interviews and Transcripts Czech ~
developed by The University of Southern California's Shoah Foundation
Institute (USC-SFI) and the University of West Bohemia as part of the
MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains
approximately 143 hours of interviews from 420 interviewees along with
transcripts and other documentation.
New LDC website enhancements coming
soon
Look for LDC’s new website enhancements in the coming weeks. We've revamped our membership services to make it easier than ever for you to manage your membership and access data more quickly.
Look for LDC’s new website enhancements in the coming weeks. We've revamped our membership services to make it easier than ever for you to manage your membership and access data more quickly.
New publications
(1) GALEArabic-English Parallel Aligned Treebank -- Broadcast News Part 2
was developed by LDC and contains 141,058 tokens of word aligned Arabic and
English parallel text with treebank annotations. This material was used as
training data in the DARPA GALE (Global Autonomous Language Exploitation)
program.
Parallel aligned treebanks are
treebanks annotated with morphological and syntactic structures aligned at the
sentence level and the sub-sentence level. Such data sets are useful for
natural language processing and related fields, including automatic word
alignment system training and evaluation, transfer-rule extraction, word sense
disambiguation, translation lexicon extraction and cultural heritage and
cross-linguistic studies. With respect to machine translation system
development, parallel aligned treebanks may improve system performance with
enhanced syntactic parsers, better rules and knowledge about language pairs and
reduced word error rate.
In this release, the source Arabic
data was translated into English. Arabic and English treebank annotations were
performed independently. The parallel texts were then word aligned. The
material in this corpus corresponds to a portion of the Arabic treebanked data
in Arabic Treebank - Broadcast News v1.0 (LDC2012T07).
The source data consists of Arabic
broadcast news programming collected by LDC in 2007 and 2008. All data is encoded
as UTF-8. A count of files, words, tokens and segments is below.
Language
|
Files
|
Words
|
Tokens
|
Segments
|
Arabic
|
31
|
110,690
|
141,058
|
7,102
|
The purpose of the GALE word
alignment task was to find correspondences between words, phrases or groups of
words in a set of parallel texts. Arabic-English word alignment annotation
consisted of the following tasks:
- Identifying different types of links: translated
(correct or incorrect) and not translated (correct or incorrect)
- Identifying sentence segments not suitable for
annotation, e.g., blank segments, incorrectly-segmented segments, segments
with foreign languages
- Tagging unmatched words attached to other words or phrases
GALE Arabic-English Parallel Aligned
Treebank -- Broadcast News Part 2 is distributed via web download.
2014 Subscription Members will
automatically receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members
may license this data for a fee.
*
(2) King Saud University Arabic Speech Database was
developed by King Saud University and contains 590 hours of recorded Arabic speech from
male and female speakers. The utterances include read and spontaneous speech.
The recordings were conducted in varied environments representing quiet and
noisy settings.
The corpus was designed principally
for speaker recognition research. The speech sources are sentences, word lists,
prose and question and answer sessions. Read speech text includes the
following:
- Sets of sentences devised to cover allophones of each
phoneme, phonetic balance, and differentiation of accents.
- Word lists developed to minimize missing phonemes and
to represent nasals fricatives, commonly used words, and numbers.
- Two paragraphs, one from the Quran and another from a
book, selected because they included all letters of the alphabet and were
easy to read.
Spontaneous speech was captured
through question and answer sessions between participants and project team
members. Speakers responded to questions on general topics such as the weather
and food.
Each speaker was recorded in three
different environments: a sound proof room, an office, and a cafeteria. The
recordings were collected via microphone and mobile phone and averaged between
16-19 minutes. The data was verified for missing recordings, problems with the
recording system or errors in the recording process.
King Saud University Arabic Speech
Database is distributed on one hard disk.
2014 Subscription Members will
receive a copy of this data provided that they have completed the User License Agreement. 2014 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members
may license this data for a fee.
*
(3)
NIST2012 Open Machine Translation (OpenMT) Progress Test Five Language Source
was developed by NIST Multimodal Information Group. This release contains the
evaluation sets (source data and human reference translations), DTD, scoring
software, and evaluation plan for the OpenMT 2012 test for Arabic, Chinese,
Dari, Farsi, and Korean to English on a parallel data set. The set is based on
a subset of the Arabic-to-English and Chinese-to-English progress tests from
the OpenMT 2008, 2009 and 2012 evaluations with new source data created by
humans based on the English reference translation. The package was compiled,
and scoring software was developed, at NIST, making use of newswire and web
data and reference translations developed by the Linguistic Data
Consortium and the Defense Language Institute Foreign Language Center.
The objective of the OpenMT
evaluation series is to support research in, and help advance the state of the
art of, machine translation (MT) technologies -- technologies that translate
text between human languages. Input may include all forms of text. The goal is
for the output to be an adequate and fluent translation of the original. The
2012 task included the evaluation of five language pairs: Arabic-to-English,
Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English in
two source data styles. For general information about the NIST OpenMT
evaluations, refer to the NIST OpenMT website.
This evaluation kit includes a
single Perl script (mteval-v13a.pl) that may be used to produce a translation
quality score for one (or more) MT systems. The script works by comparing the
system output translation with a set of (expert) reference translations of the
same source text. Comparison is based on finding sequences of words in the
reference translations that match word sequences in the system output
translation.
This release consists of 20 files,
four for each of the five languages, presented in XML with an included DTD. The
four files are source and reference data in the following two styles:
- English-true: an English-oriented translation this
requires that the text read well and not use any idiomatic expressions in
the foreign language to convey meaning, unless absolutely necessary.
- Foreign-true: a translation as close as possible to the
foreign language, as if the text had originated in that language.
NIST 2012 Open Machine Translation
(OpenMT) Progress Test Five Language Source is distributed via web download.
2014 Subscription Members will
automatically receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.