Thursday, May 15, 2014

LDC May 2014 Newsletter

LDC at LREC 2014

New publications:
GALE Arabic-English Word Alignment Training Part 2 -- Newswire  
Hispanic-English Database  
HyTER Networks of Selected OpenMT08/09 Progress Set Sentences  



LDC at LREC 2014
LDC will attend the 9th Language Resource Evaluation Conference (LREC2014), hosted by ELRA, the European Language Resource Association. The conference will be held in Reykjavik, Iceland from May 26-31 and features a broad range of sessions on language resource and human language technologies research. Ten LDC staff members will be presenting current work on topics including the language application grid project, collecting natural SMS and chat conversations in multiple languages, incorporating alternate translations into English translation treebanks, supporting HLT research with degraded audio data, developing an Egyptian Arabic Treebank and more.

Following the conference LDC’s presented papers and posters will be available on LDC’s Papers Page


New publications
(1) GALE Arabic-English Word Alignment Training Part 2 -- Newswire was developed by LDC and contains 162,359 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source newswire collected by LDC in 2004 - 2006 and 2008. The distribution by genre, words, character tokens and segments appears below:

Language
Genre
Files
Words
CharTokens
Segments
Arabic
NW
1,126
112,318
162,359
5,349

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:
  • Identifying and correcting incorrectly tokenized tokens
  • Identifying different types of links
  • Identifying sentence segments not suitable for annotation, such as those that were blank, incorrectly-segmented or containing other languages
  • Tagging unmatched words attached to other words or phrases
GALE Arabic-English Word Alignment Training Part 2 -- Newswire is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


*

(2) Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc, a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999.

Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities.

Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data.

Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension.

Hispanic-English Database is distributed on 1 DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) HyTER Networks of Selected OpenMT08/09 Progress Set Sentences was developed by SDL and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected source Arabic and Chinese sentences from OpenMT08 and OpenMT09 Progress Set data. HyTER is an evaluation metric based on large reference networks created by an annotation tool that allows users to develop an exponential number of correct translations for a given sentence. Reference networks can be used as a foundation for developing improved machine translation evaluation metrics and for automating the evaluation of human translation efficiency.

The source material is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations under three annotation protocols. In the first protocol, foreign language native speakers built English networks starting from foreign language sentences. In the second, English native speakers built English networks from the best translation of a foreign language sentence as identified by NIST (National Institute of Standards and Technology). In the third protocol, English native speakers built English networks starting from the best translation, but those annotators also had access to three additional, independently produced human translations. Networks created by different annotators for each sentence were combined and evaluated.

This release includes the source sentences and four human reference translations produced by LDC in XML format, along with five machine translation system outputs representing a variety of system architectures and performance, and the human post-edited output of those systems also presented in XML.

HyTER Networks of Selected OpenMT08/09 Progress Set Sentences is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.