Linguistic Data Consortium: April 2013

Checking in with LDC Data Scholarship Recipients

New publications:
GALE Phase 2 Chinese Broadcast Conversation Speech
GALE Phase 2 Chinese Broadcast Conversation Transcripts
NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets

Checking in with LDC Data Scholarship Recipients

The LDC Data Scholarship program provides college and university students with access to LDC data at no-cost. Students are asked to complete an application which consists of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. LDC introduced the Data Scholarship program during the Fall 2010 semester. Since that time, more than thirty individual students and student research groups have been awarded no-cost copies of LDC data for their research endeavors. Here is an update on the work of a few of the student recipients:

Leili Javadpour - Louisiana State University (USA), Engineering Science. Leili was awarded a copy of BBN Pronoun Coreference and Entity Type Corpus (LDC2005T33) and Message Understanding Conference (MUC) 7 (LDC2001T02) for her work in pronominal anaphora resolution. Leili's research involves a learning approach for pronominal anaphora resolution in unstructured text. She evaluated her approach on the BBN Pronoun Coreference and Entity Type Corpus and obtained encouraging results of 89%. In this approach machine learning is applied to a set of new features selected from other computational linguistic research. Leili's future plans involve evaluating the approach on Message Understanding Conference (MUC) 7 as well as on other genres of annotated text such as stories and conversation transcripts.

Olga Nickolaevna Ladoshko - National Technical University of Ukraine “KPI” (Ukraine), graduate student, Acoustics and Acoustoelectronics. Olga was awarded copies of NTIMT (LDC93S2) and STC-TIMIT 1.0 (LDC2008S03) for her research in automatic speech recognition for Ukrainian. Olga used NTIMIT in the first phase of her research; one problem she investigated was the influence of telephone communication channels on the reliability of phoneme recognition in different types of parametrization and configuration speech recognition systems on the basis of HTK tools. The second phase involves using NTIMIT to test the algorithm for determining voice in non-stationary noise. Her future work with STC-TIMIT 1.0 will include an experiment to develop an improved speech recognition algorithm, allowing for increased accuracy under noisy conditions.
Genevieve Sapijaszko - University of Central Florida (USA), Phd Candidate, Electrical and Computer Engineering. Genevieve was awarded a copy TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) and YOHO Speaker Verification (LDC94S16) for her work in digital signal processing. Her experiment used VQ and Euclidean distance to recognize a speaker's identity through extracting the features of the speech signal by the following methods: RCC, MFCC, MFCC + ΔMFCC, LPC, LPCC, PLPCC and RASTA PLPCC. Based on the results, in a noise free environment MFCC, (at an average of 94%), is the best feature extraction method when used in conjunction with the VQ model. The addition of the ΔMFCC showed no significant improvement to the recognition rate. When comparing three phrases of differing length, the longer two phrases had very similar recognition rates but the shorter phrase at 0.5 seconds had a noticeable lower recognition rate across methods. When comparing recognition time, MFCC was also faster than other methods. Genevieve and her research team concluded that MFCC in a noise free environment was the best method in terms of recognition rate and recognition rate time.
John Steinberg - Temple University (USA), MS candidate, Electrical and Computer Engineering. John was awarded a copy of CALLHOME Mandarin Chinese Lexicon (LDC96L15) and CALLHOME Mandarin Chinese Transcripts (LDC96T16) for his work in speech recognition. John used the CALLHOME Mandarin Lexicon and Transcripts to investigate the integration of Bayesian nonparametric techniques into speech recognition systems. These techniques are able to detect the underlying structure of the data and theoretically generate better acoustic models than typical parametric approaches such as HMM. His work investigated using one such model, Dirichlet process mixtures, in conjunction with three variational Bayesian inference algorithms for acoustic modeling. The scope of his work was limited to a phoneme classification problem since John's goal was to determine the viability of these algorithms for acoustic modeling.

One goal of his research group is to develop a speech recognition system that is robust to variations in the acoustic channel. The group is also interested in building acoustic models that generalize well across languages. For these reasons, both CALLHOME English and CALLHOME Mandarin data were used to help determine if these new Bayesian nonparametric models were prone to any language specific artifacts. These two languages, though phonetically very different, did not yield significantly different performances. Furthermore, one variational inference algorithm- accelerated variational Dirichlet process mixtures (AVDPM) - was found to perform well on extremely large data sets.

New publications

(1) GALE Phase 2 Chinese Broadcast Conversation Speech (LDC2013S04) was developed by LDC and is comprised of approximately 120 hours of Chinese broadcast conversation speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 2 Chinese Broadcast Conversation Transcripts (LDC2013T08).

Broadcast audio for the GALE program was collected at the Philadelphia, PA USA facilities of LDC and at three remote collection sites: HKUST (Chinese) Medianet, Tunis, Tunisia (Arabic) and MTC, Rabat, Morocco (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; Hubei TV, a regional broadcaster in Mainland China, Hubei Province; and Phoenix TV, a Hong Kong-based satellite television station. A table showing the number of programs and hours recorded from each source is contained in the readme file.

This release contains 202 audio files presented in Waveform Audio File format (.wav), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about the genre, data type and topic of a program.

GALE Phase 2 Chinese Broadcast Conversation Speech is distributed on 4 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data fora fee.

(2) GALE Phase 2 Chinese Broadcast Conversation Transcripts (LDC2013T08) was developed by LDC and contains transcriptions of approximately 120 hours of Chinese broadcast conversation speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding audio data is released as GALE Phase 2 Chinese Broadcast Conversation Speech (LDC2013S04).

The source broadcast conversation recordings feature interviews, call-in programs and round table discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; Hubei TV, a regional broadcaster in Mainland China, Hubei Province; and Phoenix TV, a Hong Kong-based satellite television station.

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,523,373 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Chinese Broadcast Conversation Transcripts is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07) was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English progress test sets for the NIST OpenMT 2008, 2009, and 2012 evaluations. The test data remained unseen between evaluations and was reused unchanged each time. The package was compiled, and scoring software was developed, at NIST, making use of Chinese and Arabic newswire and web data and reference translations collected and developed by LDC.

The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original.

The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

This release contains 2,748 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. The table below displays statistics by source, genre, documents, segments and source tokens.

Source	Genre	Documents	Segments	Source Tokens
Arabic	Newswire	84	784	20039
Arabic	Web Data	51	594	14793
Chinese	Newswire	82	688	26923
Chinese	Web Data	40	682	19112

NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, April 15, 2013

LDC April 2013 Newsletter