New publications:
GALE Arabic-English Word Alignment Training Part 2 -- Newswire
Hispanic-English Database
HyTER Networks of Selected OpenMT08/09 Progress Set Sentences
LDC at LREC 2014
LDC will attend the 9th Language Resource Evaluation Conference (LREC2014), hosted by ELRA, the European Language Resource Association. The conference will be held in Reykjavik, Iceland from May 26-31 and features a broad range of sessions on language resource and human language technologies research. Ten LDC staff members will be presenting current work on topics including the language application grid project, collecting natural SMS and chat conversations in multiple languages, incorporating alternate translations into English translation treebanks, supporting HLT research with degraded audio data, developing an Egyptian Arabic Treebank and more.
New publications
(1) GALE
Arabic-English Word Alignment Training Part 2 -- Newswire was
developed by LDC and contains 162,359 tokens of word aligned Arabic and English
parallel text enriched with linguistic tags. This material was used as training
data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical
machine translation include the incorporation of linguistic knowledge in word
aligned text as a means to improve automatic word alignment and machine
translation quality. This is accomplished with two annotation schemes:
alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation
approaches. A set of word tags and alignment link tags are designed in the
tagging scheme to describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the alignment
annotation.
This release consists of Arabic
source newswire collected by LDC in 2004 - 2006 and 2008. The distribution by
genre, words, character tokens and segments appears below:
Language
|
Genre
|
Files
|
Words
|
CharTokens
|
Segments
|
Arabic
|
NW
|
1,126
|
112,318
|
162,359
|
5,349
|
Note that word count is based on the
untokenized Arabic source, and token count is based on the tokenized Arabic
source.
The Arabic word alignment tasks
consisted of the following components:
- Identifying and correcting incorrectly tokenized tokens
- Identifying different types of links
- Identifying sentence segments not suitable for annotation, such as those that were blank, incorrectly-segmented or containing other languages
- Tagging unmatched words attached to other words or phrases
GALE Arabic-English Word Alignment
Training Part 2 -- Newswire is distributed via web download.
2014 Subscription Members will
automatically receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(2) Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc, a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999.
Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities.
Read speech was recorded on two
wideband channels with a Shure SM10A head-mounted microphone in a quiet
laboratory environment. The conversational speech was simultaneously recorded
on four channels, two of which were used to place phone calls to each subject
in two separate offices and to record the incoming speech of the two channels
into separate files. The audio was originally saved under the Entropic Audio
(ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were
converted to flac compressed .wav files from the ESPS format. ESPS headers were
removed and are presented in this release as *.hdr files that include
demographic and technical data.
Transcripts were developed with the
Entropic Annotator tool and are time-aligned with speaker turns. The
transcription conventions were based on those used in the LDC Switchboard
and CALLHOME
collections. Transcript files are denoted with a .lab extension.
Hispanic-English Database is
distributed on 1 DVD-ROM.
2014 Subscription Members will
automatically receive two copies of this data. 2014 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members
may license this data for a fee.
*
(3) HyTER
Networks of Selected OpenMT08/09 Progress Set Sentences was
developed by SDL
and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected
source Arabic and Chinese sentences from OpenMT08
and OpenMT09 Progress Set data. HyTER is an evaluation metric based
on large reference networks created by an annotation tool that allows users to
develop an exponential number of correct translations for a given sentence.
Reference networks can be used as a foundation for developing improved machine
translation evaluation metrics and for automating the evaluation of human
translation efficiency.
The source material is comprised of
Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators
created meaning-equivalent annotations under three annotation protocols. In the
first protocol, foreign language native speakers built English networks
starting from foreign language sentences. In the second, English native
speakers built English networks from the best translation of a foreign language
sentence as identified by NIST (National Institute of Standards and
Technology). In the third protocol, English native speakers built English
networks starting from the best translation, but those annotators also had
access to three additional, independently produced human translations. Networks
created by different annotators for each sentence were combined and evaluated.
This release includes the source
sentences and four human reference translations produced by LDC in XML format,
along with five machine translation system outputs representing a variety of
system architectures and performance, and the human post-edited output of those
systems also presented in XML.
HyTER Networks of Selected
OpenMT08/09 Progress Set Sentences is distributed via web download.
2014 Subscription Members will
automatically receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
No comments:
Post a Comment