Renew your LDC membership today
Reduced fees for Treebank-2
and Treebank-3
LDC to close for Winter Break
New publications:
Renew your LDC
membership today
Membership Year 2015 (MY2015) discounts are available for those who keep their membership current and join early in the year. Check here for further information including our planned publications for MY2015.
Now is also a good time to consider joining LDC for the current and open membership years, MY2014 and MY2013. MY2014 offers members an impressive 37 publications which include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. MY2013 remains open through the end of the 2014 calendar year and its publications include Mixer 6 speech, Greybeard, UN parallel text and CSC Deceptive Speech as well as updates to Chinese Treebank and Chinese Proposition Bank. For full descriptions of these data sets, visit our Catalog.
Membership Year 2015 (MY2015) discounts are available for those who keep their membership current and join early in the year. Check here for further information including our planned publications for MY2015.
Now is also a good time to consider joining LDC for the current and open membership years, MY2014 and MY2013. MY2014 offers members an impressive 37 publications which include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. MY2013 remains open through the end of the 2014 calendar year and its publications include Mixer 6 speech, Greybeard, UN parallel text and CSC Deceptive Speech as well as updates to Chinese Treebank and Chinese Proposition Bank. For full descriptions of these data sets, visit our Catalog.
The deadline for the Spring 2015 LDC Data
Scholarship Program is right around the corner! Student
applications are being accepted now through January 15, 2015,
11:59PM EST. The LDC Data Scholarship program provides university
students with access to LDC data at no cost. This program is open
to students pursuing both undergraduate and graduate studies in an
accredited college or university. LDC Data Scholarships are not
restricted to any particular field of study; however, students
must demonstrate a well-developed research agenda and a bona fide
inability to pay.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.
Treebank-2 (LDC95T7) and
Treebank-3 (LDC99T42)
are now available to non-members at reduced fees, US$1500 for
Treebank-2 and US$1700 for Treebank-3, reductions of 52% and 47%,
respectively.
LDC to close for Winter
Break
LDC will be closed from December 25, 2014
through January 2, 2015 in accordance with the University of
Pennsylvania Winter Break Policy. Our offices will reopen on
January 5, 2015. Requests received for membership renewals and
corpora during the Winter Break will be processed at that time.
Best wishes for a relaxing holiday season!
Best wishes for a relaxing holiday season!
New publications
(1) Benchmarks for Open Relation Extraction was developed by the University of Alberta and contains annotations for approximately 14,000 sentences from The New York Times Annotated Corpus (LDC2008T19) and Treebank-3 (LDC99T42). This corpus was designed to contain benchmarks for the task of open relation extraction (ORE), along with sample extractions from ORE methods and evaluation scripts for computing a method's precision and recall.
ORE attempts to extract as many relations as described in a corpus without relying on relation-specific training data. The traditional approach to relation extraction requires substantial training effort for each relation of interest. That can be unpractical for massive collections such as found on the web. Open relation extraction offers an alternative by extracting unseen relations as they come. It does not require training data for any particular relation, making it suitable for applications that require a large (or even unknown) number of relations. Results published in ORE literature are often not comparable due to the lack of reusable annotations and differences in evaluation methodology. The goal of this benchmark data set is to provide annotations that are flexible and can be used to evaluate a wide range of methods.
Binary and n-ary relations were extracted from
the text sources. Sentences were annotated for binary relations
manually and automatically. In the manual sentence annotation, two
entities and a trigger (a single token indicating a relation) were
identified for the relation between them, if one existed. A window
of tokens allowed to be in a relation was specified; that included
modifiers of the trigger and prepositions connecting triggers to
their arguments. For each sentence annotated with two entities, a
system must extract a string representing the relation between
them. The evaluation method deemed an extraction as correct if it
contained the trigger and allowed tokens only. The automatic
annotator identified pairs of entities and a trigger of the
relation between them; the evaluation script for that experiment
deemed an extraction correct if it contained the annotated
trigger. For n-ary relations, sentences were annotated with one
relation trigger and all of its arguments. An extracted argument
was deemed correct if it was annotated in the sentence.
Benchmarks for Open Relation Extractions is
distributed via web download.
2014 Subscription Members will automatically
receive two copies of this data provided they have completed a
copy of the user
agreement. 2014 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this
data for a fee.
(2) Fisher and CALLHOME
Spanish--English Speech Translation was developed at Johns
Hopkins University and contains English reference translations
and speech recognizer output (in various forms) that complement
the LDC Fisher Spanish (LDC2010T04) and CALLHOME Spanish audio
and transcript releases (LDC96T17). Together, they make a
four-way parallel text dataset representing approximately 38
hours of speech, with defined training, development, and
held-out test sets.
The
source data are the Fisher Spanish and CALLOME Spanish corpora
developed by LDC, comprising transcribed telephone conversations
between (mostly native) Spanish speakers in a variety of
dialects. The Fisher Spanish data set consists of 819
transcribed conversations on an assortment of provided topics
primarily between strangers, resulting in approximately 160
hours of speech aligned at the utterance level, with 1.5 million
tokens. The CALLHOME Spanish corpus comprises 120 transcripts of
spontaneous conversations primarily between friends and family
members, resulting in approximately 20 hours of speech aligned
at the utterance level, with just over 200,000 words (tokens) of
transcribed text.
Translations
were obtained by crowdsourcing using Amazon's Mechanical Turk,
after which the data was split into training, development, and
test sets. The CALLHOME data set defines its own data splits,
organized into train, devtest, and evltest, which were retained
here. For the Fisher material, four data splits were produced: a
large training section and three test sets. These test sets
correspond to portions of the data where four translations
exist.
Fisher and CALLHOME Spanish--English Speech
Translation is distributed via web download.
2014
Subscription Members will automatically receive two copies of
this data on disc. 2014 Standard Members may request a copy as
part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(3) GALE Phase 3
Chinese Broadcast Conversation Speech Part 1 was developed
by LDC and is comprised of approximately 126 hours of Mandarin
Chinese broadcast conversation speech collected in 2007 by LDC and
Hong University of Science and Technology (HKUST), Hong Kong,
during Phase 3 of the DARPA GALE (Global Autonomous Language
Exploitation) Program.
Corresponding transcripts are released as GALE
Phase 3 Chinese Broadcast Conversation Transcripts Part 1 (LDC2014T28).
Broadcast audio for the GALE program was
collected at LDC’s Philadelphia, PA USA facilities and at three
remote collection sites: HKUST (Chinese), Medianet (Tunis,
Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined
local and outsourced broadcast collection supported GALE at a rate
of approximately 300 hours per week of programming from more than
50 broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program. HKUST collected
Chinese broadcast programming using its internal recording system
and a portable broadcast collection platform designed by LDC and
installed at HKUST in 2006.
The broadcast conversation recordings in this
release feature interviews, call-in programs, and roundtable
discussions focusing principally on current events from the
following sources: Anhui TV, a regional television station in
Anhui Province, China; Beijing TV, a national television station
in China; China Central TV (CCTV), a Chinese national and
international broadcaster; Hubei TV, a regional broadcaster in
Hubei Province, China; and Phoenix TV, a Hong Kong-based satellite
television station.
This release contains 217 audio files presented
in FLAC-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit
PCM. Each file was audited by a native Chinese speaker following
Audit Procedure Specification Version 2.0 which is included in
this release. The broadcast auditing process served three
principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or
faulty recordings, as an indicator of broadcast schedule changes
by identifying instances when the incorrect program was recorded,
and as a guide for data selection by retaining information about a
program’s genre, data type and topic.
GALE Phase 3 Chinese Broadcast Conversation
Speech Part 1 is distributed on 2 DVD-ROM.
2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(4) GALE Phase 3
Chinese Broadcast Conversation Transcripts Part 1 was
developed by LDC and contains transcriptions of approximately 126
hours of Chinese broadcast conversation speech collected in 2007
by LDC and Hong University of Science and Technology (HKUST), Hong
Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language
Exploitation) Program.
Corresponding audio data is released as GALE
Phase 3 Chinese Broadcast Conversation Speech Part 1 (LDC2014S09).
The source broadcast conversation recordings
feature interviews, call-in programs and roundtable discussions
focusing principally on current events from the following sources:
Anhui TV, a regional television station in Anhui Province, China;
Beijing TV, a national television station in China; China Central
TV (CCTV), a Chinese national and international broadcaster; Hubei
TV, a regional television station in Hubei Province, China; and
Phoenix TV, a Hong Kong-based satellite television station.
The transcript files are in plain-text,
tab-delimited format (TDF) with UTF-8 encoding, and the
transcribed data totals 1,556,904 tokens. The transcripts were
created with the LDC-developed transcription tool, XTrans, a
multi-platform, multilingual, multi-channel transcription tool
that supports manual transcription and annotation of audio
recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans
.
The files in this corpus were transcribed by
LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDC's quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) both of which
are included in the documentation with this release. QTR
transcription consists of quick (near-) verbatim, time-aligned
transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries
and manual sentence unit annotation to the core components of a
quick transcript. Files with QTR as part of the filename were
developed using QTR transcription. Files with QRTR in the filename
indicate QRTR transcription.
GALE Phase 3 Chinese Broadcast Conversation
Transcripts Part 1 is distributed via web download.
2014 Subscription Members will
automatically receive two copies of this data on disc. 2014
Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for
a fee.