Spring 2015 LDC Data Scholarship recipients
2001 HUB5 English Evaluation update
New publications:
_________________________________________________________________________
Spring 2015 LDC Data Scholarship recipients
Congratulations to the recipients of
LDC's Spring 2015 data scholarships:
Christopher Kotfila ~ State University of New York, Albany (USA), PhD Candidate, Informatics. Christopher has been awarded copies of Message Understanding Conference and ACE 2005 SpatialML for his work in named entity extraction.
Ilia Markov ~ National Polytechnic University (Mexico), PhD candidate, Computer Science. Ilia has been awarded a copy of the ETS Corpus of Non-Native Written English for his work in native language identification
Matthew Nelson ~ Georgia State University (USA), MA candidate, Applied Linguistics. Matthew has been awarded a copy of TIMIT and Nationwide Speech for his work in speaker perception.
Meladianos Polykarpos ~ Athens University of Economics and Business (Greece), PhD candidate, Informatics. Meladianos has been awarded a copy of TDT5 Text and Topics/Annotations for his work in information retrieval.
Benjamin Schloss ~ Pennsylvania State University (USA), PhD candidate, Psychology. B
Benjamin has been awarded a copy of the ETS Corpus of Non-Native Written English for his work in semantics.
For program information visit the Data Scholarship page.
2001 HUB5
English Evaluation update
2001 HUB5 English Evaluation (LDC2002S13)
now includes corresponding transcriptions. The transcripts are
available as part of the web download for this data.
Additionally, all HUB5 English catalog entries have been updated
to reflect LDC's current standards for documentation and metadata.
New publications:
(1) GALE Chinese-English Parallel Aligned Treebank -- Training was developed by LDC and contains 229,249 tokens of word
aligned Chinese and English parallel text with treebank annotations. This
material was used as training data in the DARPA GALE (Global Autonomous
Language Exploitation) program.
The Chinese source data was
translated into English. Chinese and English treebank annotations were
performed independently. The parallel texts were then word aligned. The
material in this release corresponds to portions of the Chinese treebanked data
in Chinese Treebank 6.0 (LDC2007T36) (CTB),
OntoNotes 3.0 (LDC2009T24)
and OntoNotes 4.0 (LDC2011T03).
This release consists of Chinese
source broadcast programming (China Central TV, Phoenix TV), newswire (Xinhua
News Agency) and web data collected by LDC. The distribution by genre, words,
character tokens, treebank tokens and segments appears below:
Genre
|
Files
|
Words
|
CharTokens
|
CTBTokens
|
Segments
|
bc
|
10
|
57,571
|
86,356
|
60,270
|
3,328
|
nw
|
172
|
64,337
|
96,505
|
57,722
|
2,092
|
wb
|
86
|
30,925
|
46,388
|
31,240
|
1,321
|
Total
|
268
|
152,833
|
229,249
|
149,232
|
6,741
|
Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.
- Identifying, aligning, and tagging eight different types of links
- Identifying, attaching, and tagging local-level unmatched words
- Identifying and tagging sentence/discourse-level unmatched words
- Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link
This release contains nine types of
files - Chinese raw source files, English raw translation files, Chinese
character tokenized files, Chinese CTB tokenized files, English tokenized
files, Chinese treebank files, English treebank files, character-based word
alignment files, and CTB-based word alignment files.
GALE Chinese-English Parallel
Aligned Treebank -- Training is distributed via web download. 2015
Subscription Members will automatically receive two copies of this corpus on
disc. 2015 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(2) GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the
parallel text in this release comprised training data for Phases 3 and 4 of the
DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus
contains Modern Standard Arabic source text and corresponding English
translations selected from broadcast conversation data collected by LDC between
2006 and 2008 and transcribed and translated by LDC or under its direction.
GALE Phase 3 and 4 Arabic Broadcast
Conversation Parallel Text includes 55 source-translation document pairs,
comprising 280,535 words of Arabic source text and its English translation.
Data is drawn from 22 distinct Arabic programs broadcast between 2006 and 2008.
Broadcast conversation programming is generally more interactive than
traditional news broadcasts and includes talk shows, interviews, call-in
programs and roundtables.
The files in this release were
transcribed by LDC staff and/or transcription vendors under contract to LDC in
accordance with the Quick Rich Transcription guidelines developed by LDC.
Transcribers indicated sentence boundaries in addition to transcribing the text.
The transcribed and segmented files were reformatted into a human-readable
translation format and assigned to translation vendors. Translators followed
LDC's Arabic to English translation guidelines. Bilingual LDC staff performed
quality control procedures on the completed translations.
GALE Phase 3 and 4 Arabic Broadcast
Conversation Parallel Text is distributed via web download. 2015
Subscription Members will automatically receive two copies of this corpus on
disc. 2015 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(3) Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University
and Universiti Sains Malaysia
in Singapore and Malaysia, respectively. It is comprised of approximately 192
hours of Mandarin-English code-switching speech from 156 speakers with
associated transcripts.
Code-switching refers to the
practice of shifting between languages or language varieties during
conversation. This corpus focuses on the shift between Mandarin and English by
Malaysian and Singaporean speakers. Speakers engaged in unscripted
conversations and interviews. In the conversational speech segments, two
speakers conversed freely with each other. The interviews consisted of
questions from an interviewer and answers from an interviewee; only the
interviewee's speech was recorded. Topics discussed range from hobbies,
friends, and daily activities.
The speakers were gender-balanced
(49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the
speakers were Singaporean; the rest were Malaysian.
The speech recordings were conducted
in a quiet room using several microphones and recording devices. Details about
the recording conditions are contained in the documentation provided with this
release. The audio files in this corpus are 16KHz, 16-bit recordings in flac
compressed wav format between 20 and 120 minutes in length.
Selected
segments of the audio recordings were transcribed. Most of those segments
contain code-switching utterances.
Mandarin-English Code-Switching in South-East Asia is distributed on two DVD-ROM. 2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
No comments:
Post a Comment