Friday, July 18, 2014

LDC July 2014 Newsletter


New publications:








Fall 2014 Data Scholarship Program


Applications are now being accepted through Monday, September 15, 2014, 11:59PM EST for the Fall 2014 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost.
 

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.
 

The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

 

 Applicants should consult the LDC Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.
 

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full non-member fee for the data and verify the student's need for data.
 

 For further information on application materials and program rules, please visit the LDC Data Scholarship page.



New publications

(1) 2009 NIST Language Recognition Evaluation Test Set contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese.


The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005 and 2007. The 2009 evaluation increased the number of target languages. Most of the test data originated from multilingual Voice of America (VOA) radio broadcasts assessed as being of telephone bandwidth in addition to conversational telephone speech. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release.


LDC released the prior LREs as:

2003 NIST Language Recognition Evaluation (LDC2006S31)
2005 NIST Language Recognition Evaluation (LDC2008S05)
2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)

The VOA speech data was collected by LDC in 2000 and 2001 and constitutes approximately 75% of the test set. The telephone speech was taken from LDC's Mixer 3 collection recorded between 2005 and 2007.


All test speech segments are presented as a sampled data stream in standard 8-bit 8-kHz μ-law format. Each segment is stored separately in a single channel SPHERE format file. The test segments contain three nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively.


2009 NIST Language Recognition Evaluation Test Set is distributed on 2 DVD-ROM. 2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(2) GALE Arabic-English Word Alignment Training Part 3 -- Web was developed by LDC and contains 217,158 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.


Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.


Other releases available in this series are:

GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16)
GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)
GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)
GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)
GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web (LDC2014T05)
GALE Arabic-English Word Alignment Training Part 2 -- Newswire (LDC2014T10)

This release consists of Arabic source web data collected by LDC. The distribution by genre, words, character tokens and segments appears below:



Language
Genre
Files
Words
CharTokens
Segments
Arabic
WB
2,449
154,144
217,158
7,332


Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.


The Arabic word alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed
Identifying different types of links
Identifying sentence segments not suitable for annotation
Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment Training Part 3 -- Web is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) GALE Phase 2 Chinese Newswire Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains 117,173 tokens of Chinese source text and corresponding English translations selected from newswire data collected by LDC in 2007 and transcribed by LDC or under its direction.


This release includes 167 source-translation document pairs, comprising 117,173 tokens of translated data. Data is drawn from four distinct Chinese newswire sources: China News Service, Guangming Daily, People's Daily and People's Liberation Army Daily.


The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.


Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.


GALE Phase 2 Chinese Newswire Parallel Text Part 1 is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, June 19, 2014

LDC June 2014 Newsletter

LDC at ACL 2014: June 23-25, Baltimore, MD 
Early renewing members save on fees  

Commercial use and LDC data 


New publications:
Abstract Meaning Representation (AMR) Annotation Release 1.0

ETS Corpus of Non-Native Written English 
GALE Phase 2 Chinese Broadcast News Parallel Text Part 2

MADCAT Chinese Pilot Training Set



LDC at ACL 2014: June 23-25, Baltimore, MD
ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Baltimore, MD.  LDC’s exhibition table will feature information on new developments at the consortium and some interesting giveaways.

LDC’s Seth Kulick will  present research results on “Parser Evaluation Using Derivation Trees: A Complement to evalb” (SP88) during Tuesday’s Long Paper, Short Paper, Poster & Dinner Session II (June 24, 16:50-19:20). This paper was coauthored by LDCers Ann Bies, Justin Mott, and Mark Liberman and Penn linguists Anthony Kroch and Beatrice Santorini.

LDC staff will also participate in the post-conference 2nd Workshop on EVENTS: Definition, Detection, Coreference and Representation on Friday, June 27, https://sites.google.com/site/wsevents2014/home with presentations at the poster session:

·      Inter-annotator Agreement for ERE annotation: Seth Kulick, Ann Bies and Justin Mott
·       A Comparison of the Events and Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards:  Stephanie Strassel, Zhiyi Song, Joe Ellis (all LDC) and Jacqueline Aquilar, Charley Beller, Paul McNamee, Benjamin van Durme


Early renewing members save on fees

LDC's early renewal discount program has resulted in significant savings for Membership Year (MY) 2014 members!The 100 organizations that renewed their membership or joined early for MY2014 saved over US$60,000 on membership fees. MY2013 members can still take advantage of savings and are eligible for a 5% discount when renewing for MY2014. This discount will apply throughout 2014.

Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora. For-profit members can use most LDC data for commercial applications. 



Commercial use and LDC data


For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information, https://www.ldc.upenn.edu/data-management/using/licensing.


New publications


Abstract Meaning Representation (AMR) Annotation Release 1.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Center for Computational Language and Educational Research  and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 13,000 English natural language sentences from newswire, weblogs and web discussion forums.


AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.


The source data includes discussion forums collected for the DARPA BOLT program, Wall Street Journal and translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:


Dataset
Training
Dev
Test
Totals
BOLT DF MT
1061
133
133
1327
Weblog and WSJ
0
100
100
200
BOLT DF English
1703
210
229
2142
2009 Open MT
204
0
0
204
Xinhua MT
741
99
86
926
Totals
3709
542
548
4799


Abstract Meaning Representation (AMR) Annotation Release 1.0 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$300.

*

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.


The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.


The data is sampled from essays written in 2006 and 2007 by test takers whose native languages were Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. Original raw files for 11,000 of the 12,100 tokenized files are included in this release along with prompts (topics) for the essays and metadata about the test takers’ proficiency level. The data is presented in UTF-8 formatted text files.


ETS Corpus of Non-Native Written English is distributed via web download. 


2014 Subscription Members will automatically receive two copies of this data on disc provided they have completed the user license agreement.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast news (BN) data collected by LDC between 2005 and 2007 and transcribed by LDC or under its direction.


This release includes 30 source-translation document pairs, comprising 206,737 characters of translated material. Data is drawn from 12 distinct Chinese BN programs broadcast by China Central TV, a national and international broadcaster in Mainland China; New Tang Dynasty TV, a broadcaster based in the United States; and Phoenix TV, a Hong-Kong based satellite television station. The broadcast news recordings in this release focus principally on current events.


The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.


GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Chinese Pilot Training Set contains all training data created by LDC to support a Chinese pilot collection in the DARPA MADCAT Program. The data in this release consists of handwritten Chinese documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output.


The goal of the MADCAT program was to automatically convert foreign text images into English transcripts. MADCAT Chinese pilot data was collected from Chinese source documents in three genres: newswire, weblog and newsgroup text. Chinese speaking "scribes" copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple "pages" for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions.


The handwritten, transcribed documents were next checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.


The final step was to produce a unified data format that takes multiple data streams and generates a single MADCAT XML output file which contains all required information. The resulting madcat.xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer.


This release includes 22,284 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml) along with their corresponding scanned image files in TIFF format. The annotation results in GEDI XML files include ground truth annotations and source transcripts.


MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Chinese Pilot Training Set is distributed on five DVD-ROM.


2014 Subscription Members will automatically receive two copies of this data.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for for a fee.




Thursday, May 15, 2014

LDC May 2014 Newsletter

LDC at LREC 2014

New publications:
GALE Arabic-English Word Alignment Training Part 2 -- Newswire  
Hispanic-English Database  
HyTER Networks of Selected OpenMT08/09 Progress Set Sentences  



LDC at LREC 2014
LDC will attend the 9th Language Resource Evaluation Conference (LREC2014), hosted by ELRA, the European Language Resource Association. The conference will be held in Reykjavik, Iceland from May 26-31 and features a broad range of sessions on language resource and human language technologies research. Ten LDC staff members will be presenting current work on topics including the language application grid project, collecting natural SMS and chat conversations in multiple languages, incorporating alternate translations into English translation treebanks, supporting HLT research with degraded audio data, developing an Egyptian Arabic Treebank and more.

Following the conference LDC’s presented papers and posters will be available on LDC’s Papers Page


New publications
(1) GALE Arabic-English Word Alignment Training Part 2 -- Newswire was developed by LDC and contains 162,359 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source newswire collected by LDC in 2004 - 2006 and 2008. The distribution by genre, words, character tokens and segments appears below:

Language
Genre
Files
Words
CharTokens
Segments
Arabic
NW
1,126
112,318
162,359
5,349

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:
  • Identifying and correcting incorrectly tokenized tokens
  • Identifying different types of links
  • Identifying sentence segments not suitable for annotation, such as those that were blank, incorrectly-segmented or containing other languages
  • Tagging unmatched words attached to other words or phrases
GALE Arabic-English Word Alignment Training Part 2 -- Newswire is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


*

(2) Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc, a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999.

Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities.

Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data.

Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension.

Hispanic-English Database is distributed on 1 DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) HyTER Networks of Selected OpenMT08/09 Progress Set Sentences was developed by SDL and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected source Arabic and Chinese sentences from OpenMT08 and OpenMT09 Progress Set data. HyTER is an evaluation metric based on large reference networks created by an annotation tool that allows users to develop an exponential number of correct translations for a given sentence. Reference networks can be used as a foundation for developing improved machine translation evaluation metrics and for automating the evaluation of human translation efficiency.

The source material is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations under three annotation protocols. In the first protocol, foreign language native speakers built English networks starting from foreign language sentences. In the second, English native speakers built English networks from the best translation of a foreign language sentence as identified by NIST (National Institute of Standards and Technology). In the third protocol, English native speakers built English networks starting from the best translation, but those annotators also had access to three additional, independently produced human translations. Networks created by different annotators for each sentence were combined and evaluated.

This release includes the source sentences and four human reference translations produced by LDC in XML format, along with five machine translation system outputs representing a variety of system architectures and performance, and the human post-edited output of those systems also presented in XML.

HyTER Networks of Selected OpenMT08/09 Progress Set Sentences is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Tuesday, April 15, 2014

LDC April 2014 Newsletter



(1) Domain-Specific Hyponym Relations was developed by the Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Technology at Xi’an Jiaotung University, Xi’an, Shaanxi, China. It provides more than 5,000 English hyponym relations in five domains including data mining, computer networks, data structures, Euclidean geometry and microbiology. All hypernym and hyponym words were taken from Wikipedia article titles.

A hyponym relation is a word sense relation that is an IS-A relation. For example, dog is a hyponym of animal and binary tree is a hyponym of tree structure. Among the applications for domain-specific hyponym relations are taxonomy and ontology learning, query result organization in a faceted search and knowledge organization and automated reasoning in knowledge-rich applications.

The data is presented in XML format, and each file provides hyponym relations in one domain. Within each file, the term, Wikipedia URL, hyponym relation and the names of the hyponym and hypernym words are included. The distribution of terms and relations is set forth in the table below:

Dataset
Terms
Hyponym Relations
Data Mining
278
364
Computer Network
336
399
Data Structure
315
578
Euclidean Geometry
455
690
Microbiology
1,028
3,533

Domain-Specific Hyponym Relations is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  This data is made available at no-cost to LDC members and non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license. 


*

(2) GALE Arabic-English Parallel Aligned Treebank -- Web Training was developed by LDC and contains 69,766 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.


Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned.

LDC previously released Arabic-English Parallel Aligned Treebanks as follows:
This release consists of Arabic source web data (newsgroups, weblogs) collected by LDC in 2004 and 2005. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language
Files
Words
Tokens
Segments
Arabic
162
46,710
69,766
3,178

Note: Word count is based on the untokenized Arabic source, token count is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:
  • Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect)
  • Identifying sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented segments, segments with foreign languages
  • Tagging unmatched words attached to other words or phrases

GALE Arabic-English Parallel Aligned Treebank -- Web Training is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) Multi-Channel WSJ Audio (MCWSJ) was developed by the Centre for Speech Technology Research at the University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker.

This corpus was designed to address the challenges of speech recognition in meetings, which often occur in rooms with non-ideal acoustic conditions and significant background noise, and may contain large sections of overlapping speech. Using headset microphones represents one approach, but meeting participants may be reluctant to wear them. Microphone arrays are another option. MCWSJ supports research in large vocabulary tasks using microphone arrays. The news sentences read by speakers are taken from WSJCAM0 Cambridge Read News, a corpus originally developed for large vocabulary continuous speech recognition experiments, which in turn was based on CSR-I (WSJ0) Complete, made available by LDC to support large vocabulary continuous speech recognition initiatives.

Speakers reading news text from prompts were recorded using a headset microphone, a lapel microphone and an eight-channel microphone array. In the single speaker scenario, participants read from six fixed positions. Fixed positions were assigned for the entire recording in the overlapping scenario. For the moving scenario, participants moved from one position to the next while reading.

Fifteen speakers were recorded for the single scenario, nine pairs for the overlapping scenario and nine individuals for the moving scenario. Each read approximately 90 sentences.

Multi-Channel WSJ Audio is distributed on 2 DVD-ROM.

2014 Subscription Members will receive a copy of this data provided that they have completed the User License Agreement for Multi-Channel WSJ Audio LDC2014S03. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Monday, March 17, 2014

LDC March 2014 Newsletter

New publications:

GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web  GALE Phase 2 Chinese Broadcast News Parallel Text Part 1  
USC-SFI MALACH Interviews and Transcripts Czech  


(1) GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web was developed by LDC and contains 344,680 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source newswire and web data collected by LDC in 2006 - 2008. The distribution by genre, words, character tokens and segments appears below:



Language

Genre

Docs

Words

CharTokens

Segments

Arabic

WB

119

59,696

81,620

4,383

Arabic

NW

717

198,621

263,060

8,423

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:
  • Normalizing tokenized tokens as needed
  • Identifying different types of links
  • Identifying sentence segments not suitable for annotation
  • Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 2 Chinese Broadcast News Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast news (BN) data collected by LDC between 2005 and 2007 and transcribed by LDC or under its direction.
This release includes 30 source-translation document pairs, comprising 198,350 characters of translated material. Data is drawn from 11 distinct Chinese BN. The broadcast news recordings in this release focus principally on current events.
The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC’s Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.

GALE Phase 2 Chinese Broadcast News Parallel Text Part 1 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation.

Inspired by his experience making Schindler’s List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. Within several years, the Foundation’s Visual History Archive held nearly 52,000 video testimonies in 32 languages representing 56 countries. It is the largest archive of its kind in the world. In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education.

The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives. The focus was advancing the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak.


LDC has also released USC-SFI MALACH Interviews and Transcripts English (LDC2012S05).

The speech data in this release was collected beginning in 1994 under a wide variety of conditions ranging from quiet to noisy. Original interviews were recorded on Sony Beta SP tapes, then digitized into a 3 MB/s MPEG-1 stream with 128 kb/s (44 kHz) stereo audio. The sound files in this release are single channel FLAC compressed PCM WAV format at a sampling frequency of 16 kHz.

Approximately 570 of all USC-SFI collected interviews are in Czech and average approximately 2.25 hours each. The interviews sessions in this release are divided into a training set (400 interviews) and a test set (20 interviews). The first fifteen minutes of the second tape from each training interview (approximately 30 total minutes of speech) were transcribed in .trs format using Transcriber 1.5.1. The test interviews were transcribed completely. Thus the corpus consists of 229 hours of speech (186 hours of training material plus 43 hours of test data) with 143 hours transcribed (100 hours of training material plus 43 hours of test data). Certain interviews include speech from family members in addition to that of the subject and the interviewer. Accordingly, the corpus contains speech from more than 420 speakers, who are more or less equally distributed between males and females.

USC-SFI MALACH Interviews and Transcripts Czech is distributed on four DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data provided that they have submitted a completed copy of the User License Agreement for USC-SFI MALACH Interviews and Transcripts Czech (LDC2014S04)   2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.