Linguistic Data Consortium: broadcast training

Showing posts with label broadcast training. Show all posts

Monday, March 16, 2015

LDC 2015 March Newsletter

Spring 2015 LDC Data Scholarship recipients

2001 HUB5 English Evaluation update

New publications:

GALE Chinese-English Parallel Aligned Treebank -- Training

GALE Phase 3 and 4 Arabic Broadcast Conversation ParallelText

Mandarin-English Code-Switching in South-East Asia

_________________________________________________________________________

Spring 2015 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2015 data scholarships:

Christopher Kotfila ~ State University of New York, Albany (USA), PhD Candidate, Informatics. Christopher has been awarded copies of Message Understanding Conference and ACE 2005 SpatialML for his work in named entity extraction.

Ilia Markov ~ National Polytechnic University (Mexico), PhD candidate, Computer Science. Ilia has been awarded a copy of the ETS Corpus of Non-Native Written English for his work in native language identification

Matthew Nelson ~ Georgia State University (USA), MA candidate, Applied Linguistics. Matthew has been awarded a copy of TIMIT and Nationwide Speech for his work in speaker perception.

Meladianos Polykarpos ~ Athens University of Economics and Business (Greece), PhD candidate, Informatics. Meladianos has been awarded a copy of TDT5 Text and Topics/Annotations for his work in information retrieval.

Benjamin Schloss ~ Pennsylvania State University (USA), PhD candidate, Psychology. B
Benjamin has been awarded a copy of the ETS Corpus of Non-Native Written English for his work in semantics.

For program information visit the Data Scholarship page.

2001 HUB5 English Evaluation update

2001 HUB5 English Evaluation (LDC2002S13) now includes corresponding transcriptions. The transcripts are available as part of the web download for this data. Additionally, all HUB5 English catalog entries have been updated to reflect LDC's current standards for documentation and metadata.

New publications:

(1) GALE Chinese-English Parallel Aligned Treebank -- Training was developed by LDC and contains 229,249 tokens of word aligned Chinese and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

The Chinese source data was translated into English. Chinese and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this release corresponds to portions of the Chinese treebanked data in Chinese Treebank 6.0 (LDC2007T36) (CTB), OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).

This release consists of Chinese source broadcast programming (China Central TV, Phoenix TV), newswire (Xinhua News Agency) and web data collected by LDC. The distribution by genre, words, character tokens, treebank tokens and segments appears below:

Genre	Files	Words	CharTokens	CTBTokens	Segments
bc	10	57,571	86,356	60,270	3,328
nw	172	64,337	96,505	57,722	2,092
wb	86	30,925	46,388	31,240	1,321
Total	268	152,833	229,249	149,232	6,741

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment task consisted of the following components:

Identifying, aligning, and tagging eight different types of links
Identifying, attaching, and tagging local-level unmatched words
Identifying and tagging sentence/discourse-level unmatched words
Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link

This release contains nine types of files - Chinese raw source files, English raw translation files, Chinese character tokenized files, Chinese CTB tokenized files, English tokenized files, Chinese treebank files, English treebank files, character-based word alignment files, and CTB-based word alignment files.

GALE Chinese-English Parallel Aligned Treebank -- Training is distributed via web download. 2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text includes 55 source-translation document pairs, comprising 280,535 words of Arabic source text and its English translation. Data is drawn from 22 distinct Arabic programs broadcast between 2006 and 2008. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtables.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. The transcribed and segmented files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text is distributed via web download. 2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.

Code-switching refers to the practice of shifting between languages or language varieties during conversation. This corpus focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers. Speakers engaged in unscripted conversations and interviews. In the conversational speech segments, two speakers conversed freely with each other. The interviews consisted of questions from an interviewer and answers from an interviewee; only the interviewee's speech was recorded. Topics discussed range from hobbies, friends, and daily activities.

The speakers were gender-balanced (49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian.

The speech recordings were conducted in a quiet room using several microphones and recording devices. Details about the recording conditions are contained in the documentation provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings in flac compressed wav format between 20 and 120 minutes in length.

Selected segments of the audio recordings were transcribed. Most of those segments contain code-switching utterances.

Mandarin-English Code-Switching in South-East Asia is distributed on two DVD-ROM. 2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, October 16, 2014

LDC 2014 October Newsletter

LDC at NWAV 43

LDC Data Scholarship Update

New publications:
Chinese Discourse Treebank 0.5
GALE Arabic-English Word Alignment -- Broadcast Training Part 2
United Nations Proceedings Speech ________________________________________________________________

LDC at NWAV 43

LDC will be exhibiting at the 43rd New Ways of Analyzing Variation Conference (NWAV 43) held this year October 23-26 in Chicago, Illinois. Please stop by our table in the Old Town Room on the third floor of the Hilton to learn more about the most recent developments at the Consortium and to check out our latest giveaways. As always, LDC will post conference updates via our Facebook page. We hope to see you in Chicago!

LDC Data Scholarship Update

LDC received many solid applications for the Fall 2014 LDC Data Scholarship Program. We are in the process of reviewing submissions and will announce recipients soon. The LDC Data Scholarship program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser.

Data use proposals in this cycle included a range of research interests from opinion mining tagging to deceptive speech classification.

New publications

(1) Chinese Discourse Treebank 0.5 was developed at Brandeis University as part of the Chinese Treebank Project and consists of approximately 73,000 words of Chinese newswire text annotated for discourse relations. It follows the lexically grounded approach of the Penn Discourse Treebank (PDTB) (LDC2008T05) with adaptations based on the linguistic and statistical characteristics of Chinese text. Discourse relations are lexically anchored by discourse connectives (e.g., because, but, therefore), which are viewed as predicates that take abstract objects such as propositions, events and states as their arguments. Along with PDTB-style schemes for English, Turkish, Hindi and Czech, Chinese Discourse Treebank provides an additional perspective on how the PDTB approach can be extended for cross-lingual annotation of discourse relations.

Data was selected from the newswire material in Chinese Treebank 8.0 (LDC2013T21), specifically, from Xinhua News Agency stories. There are approximately 5,500 annotation instances. Following the PDTB format, each annotation instance consists of 27 vertical bar delimited fields. The fields specify the attributes of the discourse relation as a whole, as well as the attributes of its two arguments. Not all fields are filled in this release. Filled fields are indicated by a pair of angle brackets; the remaining fields are place holders for future releases.

Chinese Discourse Treebank 0.5 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 2 was developed by LDC and contains 215,923 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast conversation data collected by LDC from 2007-2009.The Arabic word alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed

Identifying different types of links

Identifying sentence segments not suitable for annotation

Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment – Broadcast Training Part 2 is distributed via web download.

(3) United Nations Proceedi ngs Speech was developed by the United Nations (UN) and contains approximately 8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly (GA) and First Committee (FC) (Disarmament and International Security), and meetings 6434-6763 of the Security Council.

Recordings were made using a customized system following a daily internal circulated instruction from the Meetings Management Section. Most of the subjects and information related to a particular meeting or session are published in a UN Journal which can be found in the following here.

Data is presented either as mp3 or flac compressed wav and are 16-bit single channel files in either 22,050 or 8,000 Hz organized by committee and session number, then language. The folder labeled "Floor" indicates the microphone used by the particular speaker. Those files may include other languages, for instance, if the speaker's language was not among the six official UN languages.

United Nations Proceedings Speech is distributed on one hard drive.

2014 Subscription Members will receive one copy of this data, provided they have completed the user license agreement. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.