Fall 2013 LDC Data
Scholarship Recipients
New
publications:
Fall 2013 LDC Data
Scholarship Recipients
LDC is pleased to announce the
student recipients of the Fall 2013 LDC
Data Scholarship program. This program provides university and college students with
access to LDC data at no-cost. Students were asked to complete
an application which consisted of a proposal describing their
intended use of the data, as well as a letter of support from
their thesis adviser. We received many solid applications and
have chosen six proposals
to support. The following students will receive no-cost copies
of LDC data:
Shamama Afnan - Clemson University (USA), MS candidate, Electrical Engineering. Shamana has been awarded a copy of 2008 NIST Speaker Recognition Training and Test data for her work in speaker recognition.
Seyedeh Firoozabadi - University of Connecticut (USA), PhD candidate, Biomedical Engineering. Seyedeh has been awarded a copy of TIDIGITS and TI-46 Word for her work in speech recognition.
Lei Liu - Beijing Foreign Studies University (China), PhD candidate, Foreign Language Education. Lei has been awarded a copy of Treebank-3 and Prague Czech-English Dependency Treebank 2.0 for his work in parsing.
Monisankha Pal - Indian Institute of Technology, Kharagpur (India), PhD candidate, Electronics and Electrical Communication Engineering. Monisankha has been awarded a copy of CSR-I (WSJ0) and CSR-II (WSJ1) for his work in speaker recognition.
Sachin Pawar - Indian Institute of Technology, Bombay (India), PhD candidate, Computer Science and Engineering. Sachin has been awarded a copy of ACE 2004 Multilingual Training Corpus for his work in named-entity recognition.
Sergio Silva - Federal University of Rio Grande do Sul (Brazil), MS candidate, Computer Science. Sergio has been awarded a copy of 2004 and 2005 Spring NIST Rich Transcription data for his work in diarization.
New
publications
(1) GALE Phase 2 Chinese Broadcast
News Speech was developed by LDC and is
comprised of approximately 126 hours of Mandarin Chinese
broadcast news speech collected in 2006 and 2007 by the
Linguistic Data Consortium (LDC) and Hong University of Science
and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA
GALE (Global Autonomous Language Exploitation) Program.
Corresponding
transcripts are released as GALE Phase 2 Chinese Broadcast News
Transcripts (LDC2013T20).
Broadcast
audio for the GALE program was collected at LDC's Philadelphia,
PA USA facilities and at three remote collection sites: HKUST
(Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat,
Morocco) (Arabic). The combined local and outsourced broadcast
collection supported GALE at a rate of approximately 300 hours
per week of programming from more than 50 broadcast sources for
a total of over 30,000 hours of collected broadcast audio over
the life of the program.
The
broadcast conversation recordings in this release feature news
broadcasts focusing principally on current events from the
following sources: Anhui TV, a regional television station in
Mainland China, Anhui Province; China Central TV (CCTV), a
national and international broadcaster in Mainland China; and
Phoenix TV, a Hong Kong-based satellite television station.
This
release contains 248 audio files presented in FLAC-compressed Waveform Audio File format (.flac),
16000 Hz single-channel 16-bit PCM. Each file was audited by a
native Chinese speaker following Audit Procedure Specification
Version 2.0 which is included in this release. The broadcast
auditing process served three principal goals: as a check on the
operation of the broadcast collection system equipment by
identifying failed, incomplete or faulty recordings, as an
indicator of broadcast schedule changes by identifying instances
when the incorrect program was recorded, and as a guide for data
selection by retaining information about a program's genre, data
type and topic.
GALE Phase
2 Chinese Broadcast News Speech is
distributed on 2 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corporal. Nonmembers may license this data for a fee.
*
(2) GALE Phase 2 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 110 hours of Chinese broadcast news speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding
audio data is released as GALE Phase 2 Chinese Broadcast News
Speech (LDC2013S08).
The
transcript files are in plain-text, tab-delimited format (TDF)
with UTF-8 encoding, and the transcribed data totals 1,593,049
tokens. The transcripts were created with the LDC-developed
transcription tool, XTrans, a multi-platform, multilingual, multi-channel
transcription tool that supports manual transcription and
annotation of audio recordings.
The files
in this corpus were transcribed by LDC staff and/or by
transcription vendors under contract to LDC. Transcribers
followed LDC’s quick transcription guidelines (QTR) and quick
rich transcription specification (QRTR) both of which are
included in the documentation with this release. QTR
transcription consists of quick (near-)verbatim, time-aligned
transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries
and manual sentence unit annotation to the core components of a
quick transcript.
GALE Phase
2 Chinese Broadcast News Transcripts is distributed via web
download. Subscription
Members will automatically receive two copies of this data. 2013
Standard Members may request a copy as part of their 16 free membership
corporal. Nonmembers may license this data for a fee.
*
(3) OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project was to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
OntoNotes Release 5.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04, OntoNotes Release 3.0 LDC2009T24 and OntoNotes Release 4.0 LDC2011T03 -- and adds source data from and/or additional annotations for, newswire (News), broadcast news (BN), broadcast conversation (BC), telephone conversation (Tele) and web data (Web) in English and Chinese and newswire data in Arabic. Also contained is English pivot text (Old Testament and New Testament text). This cumulative publication consists of 2.9 million words
The
OntoNotes project built on two time-tested resources, following
the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic
representation includes word sense disambiguation for nouns and
verbs, with some word senses connected to an ontology, and
coreference.
Documents
describing the annotation guidelines and the routines for
deriving various views of the data from the database are
included in the documentation directory of this release. The
annotation is provided both in separate text files for each
annotation layer (Treebank, PropBank, word sense, etc.) and in
the form of an integrated relational database
(ontonotes-v5.0.sql.gz) with a Python API to provide convenient
cross-layer access.
OntoNotes
Release 5.0 is distributed on 1 DVD-ROM. Subscription
Members will automatically receive two copies of this data. 2013
Standard Members may request a copy as part of their 16 free membership
corporal. Nonmembers may license this data at no charge subject to shipping and handling fees.