Monday, February 18, 2013

LDC February 2013 Newsletter


New publications:



Spring 2013 LDC Data Scholarship Recipients! 

LDC is pleased to announce the student recipients of the Spring 2013 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen three proposals to support. The following students will receive no-cost copies of LDC data:
Salima Harrat - Ecole SupĂ©rieure d’informatique (ESI) (Algeria). Salima has been awarded a copy of Arabic Treebank: Part 3 for her work in diacritization restoration.

Maulik C. Madhavi - Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar (India). Maulik has been awarded a copy of
Switchboard Cellular Part 1 Transcribed Audio and Transcripts and 1997 HUB4 English Evaluation Speech and Transcripts for his work in spoken term detection.

Shereen M. Oraby - Arab Academy for Science, Technology, and Maritime Transport (Egypt). Shereen has been awarded a copy of
Arabic Treebank: Part 1 for her work in subjectivity and sentiment analysis.
Please join us in congratulating our student recipients! The next LDC Data Scholarship program is scheduled for the Fall 2013 semester.

Membership Fee Savings and Publications Pipeline 

Time is quickly running out to save on membership fees for Membership Year 2013 (MY2013)! Any organization which joins or renews membership for 2013 through Friday, March 1, 2013, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2012 can receive a 10% discount on fees provided they renew prior to March 1, 2013.

Many publications for MY2013 are still in development. The planned publications for the upcoming months include:
GALE data ~ continuing releases of all languages (Arabic, Chinese, English), genres (Broadcast News, Broadcast Conversation, Newswire and Web Data) and tasks (Parallel Text, Word Alignment, Parallel Aligned Treebanks, Parallel Sentences, Audio and Transcripts).
Hispanic Accented English Database ~ 30 hours of conversational speech data from non-native speakers of English with approximately 24 hours or 80% of the data  closely transcribed. The speech in this release was collected from 22  non-native, Hispanic speakers of English and consists of spontaneous speech and read utterances. The read speech is divided equally between English and Spanish.
NIST 2012 Open Machine Translation  Progress Tests ~ contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT12 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set.  This set is based on a subset of the Arabic-to-English and Chinese-to-English Progress tests from the NIST Open Machine Translation 2008, 2009, and 2012 evaluations with new source data created based on the English human reference translation reference. The original data consists of newswire and web data.
NIST Open Machine Translation 2008 to 2012 Progress Test Sets ~ contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English Progress tests of the NIST Open Machine Translation 2008, 2009, and 2012 Evaluations.  The test sets consist of newswire and web data.
OntoNotes 5.0 ~ multiple genres of English, Chinese, and Arabic text annotated for syntax, predicate argument structure and shallow semantics.
UN Parallel Text ~ contains the text of United Nations parliamentary documents in Arabic, Chinese, English, French, Russian, and Spanish from 1993 through 2007. The data is provided in two formats:  (1) raw text: the raw text is very close to what was extracted from the word processing documents, converted to UTF-8 encoding,  and (2) word-aligned text: the word-aligned text has been normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential "chunk-pairs", and then aligned at the word-level.
2013 Subscription Members are automatically sent all MY2013 data as it is released.  2013 Standard Members are entitled to request 16 corpora for free from MY2013. Non-members may license most data for research use. Visit our Announcements page for information on pricing.

New LDC Podcast, LDC Executive Director, Christopher Cieri

The
LDC blog has a new podcast in LDC’s 20th Anniversary series. This edition features LDC’s Executive Director, Christopher Cieri. In this podcast, Chris reflects on the road that took him to LDC, some of his early responsibilities and recent consortium activities. 

Click
here for Chris’ podcast. Other podcasts will be published via the LDC blog , so stay tuned to that space.
New publications

(1) GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 was developed by LDC and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program. Broadcast audio for the DARPA GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. 

The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC's local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular; all signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. 

The broadcast conversation recordings in this release feature interviews, call-in programs and round table discussions focusing principally on current events from several sources. This release contains 143 audio files presented in .wav, 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of LDCs broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program's genre, data type and topic.

GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 is distributed on 4 DVDs.
2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora.

*

(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 was developed by LDC and contains transcriptions of approximately 123 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. The source broadcast conversation recordings feature interviews, call-in programs and round table discussions focusing principally on current events from several sources.

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 752,747 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. 

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDCs quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.
*

(3) NIST 2012 Open Machine Translation (OpenMT) Evaluation was developed by NIST Multimodal Information Group. This release contains source data, reference translations and scoring software used in the NIST 2012 OpenMT evaluation, specifically, for the Chinese-to-English language pair track. The package was compiled and scoring software was developed at NIST, making use of Chinese newswire and web data and reference translations collected and developed by LDC. The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. 

The 2012 task was to evaluate five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English. This release consists of the material used in the Chinese-to-English language pair track. For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

This release contains 222 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Chinese newswire and web data collected by LDC in 2011. A portion of the web data concerned the topic of food and was treated as a restricted domain. The table below displays statistics by source, genre, documents, segments and source tokens.

Source
Genre
Documents
Segments
Source Tokens
Chinese General
Newswire
45
400
18184
Chinese General
Web Data
28
420
15181
Chinese Restricted Domain
Web Data
149
2184
48422

The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "/w". The Python “re” module was used to obtain those counts.

NIST 2012 Open Machine Translation (OpenMT) Evaluation is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.