Monday, March 16, 2015

LDC 2015 March Newsletter

Spring 2015 LDC Data Scholarship recipients

2001 HUB5 English Evaluation update

New publications:
_________________________________________________________________________

Spring 2015 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2015 data scholarships:

Christopher Kotfila ~ State University of New York, Albany (USA), PhD Candidate, Informatics. Christopher has been awarded copies of Message Understanding Conference and ACE 2005 SpatialML for his work in named entity extraction.  
  
Ilia Markov ~ National Polytechnic University (Mexico), PhD candidate, Computer Science. Ilia has been awarded a copy of the ETS Corpus of Non-Native Written English for his work in native language identification    

Matthew Nelson ~ Georgia State University (USA), MA candidate, Applied Linguistics. Matthew has been awarded a copy of TIMIT and Nationwide Speech for his work in speaker perception.   

Meladianos Polykarpos ~ Athens University of Economics and Business (Greece), PhD candidate, Informatics. Meladianos has been awarded a copy of TDT5 Text and Topics/Annotations for his work in information retrieval.  

Benjamin Schloss ~ Pennsylvania State University (USA), PhD candidate, Psychology. B
Benjamin has been awarded a copy of the ETS Corpus of Non-Native Written English for his work in semantics.

For program information visit the Data Scholarship page.

2001 HUB5 English Evaluation update
2001 HUB5 English Evaluation (LDC2002S13) now includes corresponding transcriptions.  The transcripts are available as part of the web download for this data.  Additionally, all HUB5 English catalog entries have been updated to reflect LDC's current standards for documentation and metadata.

New publications:

(1) GALE Chinese-English Parallel Aligned Treebank -- Training was developed by LDC and contains 229,249 tokens of word aligned Chinese and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

The Chinese source data was translated into English. Chinese and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this release corresponds to portions of the Chinese treebanked data in Chinese Treebank 6.0 (LDC2007T36) (CTB), OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).

This release consists of Chinese source broadcast programming (China Central TV, Phoenix TV), newswire (Xinhua News Agency) and web data collected by LDC. The distribution by genre, words, character tokens, treebank tokens and segments appears below:

Genre
   Files
   Words
    CharTokens
  CTBTokens
  Segments
bc
   10 
   57,571
    86,356
  60,270
  3,328
nw
   172
   64,337
    96,505
  57,722
  2,092
wb
   86
   30,925
    46,388
  31,240
  1,321
Total
   268
   152,833
    229,249
  149,232
  6,741

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment task consisted of the following components:
  • Identifying, aligning, and tagging eight different types of links
  • Identifying, attaching, and tagging local-level unmatched words
  • Identifying and tagging sentence/discourse-level unmatched words
  • Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link
This release contains nine types of files - Chinese raw source files, English raw translation files, Chinese character tokenized files, Chinese CTB tokenized files, English tokenized files, Chinese treebank files, English treebank files, character-based word alignment files, and CTB-based word alignment files.

GALE Chinese-English Parallel Aligned Treebank -- Training is distributed via web download.  2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text includes 55 source-translation document pairs, comprising 280,535 words of Arabic source text and its English translation. Data is drawn from 22 distinct Arabic programs broadcast between 2006 and 2008. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtables.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. The transcribed and segmented files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text is distributed via web download.  2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.

Code-switching refers to the practice of shifting between languages or language varieties during conversation. This corpus focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers. Speakers engaged in unscripted conversations and interviews. In the conversational speech segments, two speakers conversed freely with each other. The interviews consisted of questions from an interviewer and answers from an interviewee; only the interviewee's speech was recorded. Topics discussed range from hobbies, friends, and daily activities.

The speakers were gender-balanced (49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian.

The speech recordings were conducted in a quiet room using several microphones and recording devices. Details about the recording conditions are contained in the documentation provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings in flac compressed wav format between 20 and 120 minutes in length.
Selected segments of the audio recordings were transcribed. Most of those segments contain code-switching utterances.

Mandarin-English Code-Switching in South-East Asia is distributed on two DVD-ROM.  2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.