Linguistic Data Consortium: October 2013

Fall 2013 LDC Data Scholarship Recipients

New publications:

GALE Phase 2 Chinese Broadcast News Speech

GALE Phase 2 Chinese Broadcast News Transcripts

Fall 2013 LDC Data Scholarship Recipients

LDC is pleased to announce the student recipients of the Fall 2013 LDC Data Scholarship program. This program provides university and college students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen six proposals to support. The following students will receive no-cost copies of LDC data:

Shamama Afnan - Clemson University (USA), MS candidate, Electrical Engineering. Shamana has been awarded a copy of 2008 NIST Speaker Recognition Training and Test data for her work in speaker recognition.

Seyedeh Firoozabadi - University of Connecticut (USA), PhD candidate, Biomedical Engineering. Seyedeh has been awarded a copy of TIDIGITS and TI-46 Word for her work in speech recognition.

Lei Liu - Beijing Foreign Studies University (China), PhD candidate, Foreign Language Education. Lei has been awarded a copy of Treebank-3 and Prague Czech-English Dependency Treebank 2.0 for his work in parsing.

Monisankha Pal - Indian Institute of Technology, Kharagpur (India), PhD candidate, Electronics and Electrical Communication Engineering. Monisankha has been awarded a copy of CSR-I (WSJ0) and CSR-II (WSJ1) for his work in speaker recognition.

Sachin Pawar - Indian Institute of Technology, Bombay (India), PhD candidate, Computer Science and Engineering. Sachin has been awarded a copy of ACE 2004 Multilingual Training Corpus for his work in named-entity recognition.

Sergio Silva - Federal University of Rio Grande do Sul (Brazil), MS candidate, Computer Science. Sergio has been awarded a copy of 2004 and 2005 Spring NIST Rich Transcription data for his work in diarization.

New publications

(1) GALE Phase 2 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 126 hours of Mandarin Chinese broadcast news speech collected in 2006 and 2007 by the Linguistic Data Consortium (LDC) and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 2 Chinese Broadcast News Transcripts (LDC2013T20).

Broadcast audio for the GALE program was collected at LDC's Philadelphia, PA USA facilities and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature news broadcasts focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; and Phoenix TV, a Hong Kong-based satellite television station.

This release contains 248 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program's genre, data type and topic.

GALE Phase 2 Chinese Broadcast News Speech is distributed on 2 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corporal. Nonmembers may license this data for a fee.

(2) GALE Phase 2 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 110 hours of Chinese broadcast news speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 2 Chinese Broadcast News Speech (LDC2013S08).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,593,049 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript.

GALE Phase 2 Chinese Broadcast News Transcripts is distributed via web download. Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corporal. Nonmembers may license this data for a fee.

(3) OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project was to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

OntoNotes Release 5.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04, OntoNotes Release 3.0 LDC2009T24 and OntoNotes Release 4.0 LDC2011T03 -- and adds source data from and/or additional annotations for, newswire (News), broadcast news (BN), broadcast conversation (BC), telephone conversation (Tele) and web data (Web) in English and Chinese and newswire data in Arabic. Also contained is English pivot text (Old Testament and New Testament text). This cumulative publication consists of 2.9 million words

The OntoNotes project built on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation includes word sense disambiguation for nouns and verbs, with some word senses connected to an ontology, and coreference.

Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database (ontonotes-v5.0.sql.gz) with a Python API to provide convenient cross-layer access.

OntoNotes Release 5.0 is distributed on 1 DVD-ROM. Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corporal. Nonmembers may license this data at no charge subject to shipping and handling fees.

Linguistic Data Consortium

Wednesday, October 16, 2013

LDC October 2013 Newsletter