Early renewing members save
again
Commercial use and LDC data
New publications:
Coordination Annotation for the Penn Treebank
GALE Phase 3 Chinese Broadcast Conversation Speech Part 2
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2
SenSem Lexicons
GALE Phase 3 Chinese Broadcast Conversation Speech Part 2
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2
SenSem Lexicons
Early renewing members save again
LDC's early renewal discount program has resulted in substantial
savings for current year members. The 110 organizations that
renewed their membership or joined early for Membership Year 2015
(MY2015) saved over US$65,000 on membership fees. MY2014 members
are still eligible for a 5% discount when renewing through 2015.
LDC membership benefits include free membership year data as well as discounts on older corpora. For-profit members can use most LDC data for commercial applications.
LDC membership benefits include free membership year data as well as discounts on older corpora. For-profit members can use most LDC data for commercial applications.
For-profit organizations are reminded that an
LDC membership is a pre-requisite for obtaining a commercial
license to almost all LDC databases. Non-member organizations,
including non-member for-profit organizations, cannot use LDC data
to develop or test products for commercialization, nor can they
use LDC data in any commercial product or for any commercial
purpose. LDC data users should consult corpus-specific license
agreements for limitations on the use of certain corpora. Visit
our Licensing
page for further information,
New publications
(1) Coordination
Annotation for the Penn Treebank is a stand-off
annotation for the Wall Street Journal portion of Treebank-3
(PTB3) (LDC99T42)
developed by researchers at the University
of
Düsseldorf and Indiana
University. It marks all tokens that have a coordinating
function (potentially among other functions).
Coordination is a syntactic structure that
links together two or more elements known as conjuncts or
conjoins. The presence of coordination is often signaled by the
appearance of a coordinator (coordinating conjunction), such as
and, or, but in English.
This annotation is presented in a single UTF-8
plain text tsv file with columns as follows:
section: Penn Treebank WSJ section numberfile: Number of file within sectionsentence: Number of sentence (starting with 0)token: Number of token (starting with 0)annotation: "P" if the token is a coordinating punctuation, "O" otherwise
Coordination Annotation for the Penn Treebank
is available at no cost to all licensees of PTB3 and
appears in their download queue associated with LDC99T42 as
penn_coordination_anno_LDC2015T08.tgz.
*
(2) GALE Phase 3
Chinese Broadcast Conversation Speech Part 2 was developed
by LDC and is comprised of approximately 112 hours of Mandarin
Chinese broadcast conversation speech collected in 2007 and 2008
by LDC and Hong University of Science and Technology (HKUST), Hong
Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. Corresponding transcripts are released as
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 (LDC2015T09).
Broadcast audio for the GALE program was
collected at LDC’s Philadelphia, PA USA facilities and at three
remote collection sites. The combined local and outsourced
broadcast collection supported GALE at a rate of approximately 300
hours per week of programming from more than 50 broadcast sources
for a total of over 30,000 hours of collected broadcast audio over
the life of the program.
The broadcast conversation recordings in this
release feature interviews, call-in programs, and roundtable
discussions focusing principally on current events from the
following sources: Beijing TV, China Central TV, Hubei TV, Phoenix
TV and Voice of America.
This release contains 209 audio files presented
in FLAC-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit
PCM. Each file was audited by a native Chinese speaker following
Audit Procedure Specification Version 2.0 which is included in
this release. The broadcast auditing process served three
principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or
faulty recordings, as an indicator of broadcast schedule changes
by identifying instances when the incorrect program was recorded,
and as a guide for data selection by retaining information about a
program’s genre, data type and topic.
GALE Phase 3 Chinese Broadcast Conversation
Speech Part 2 is distributed on DVD. 2015 Subscription Members
will automatically receive two copies of this corpus. 2015
Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for
a fee.
*
(3) GALE Phase 3
Chinese Broadcast Conversation Transcripts Part 2 was
developed by LDC and contains transcriptions of approximately 112
hours of Chinese broadcast conversation speech collected in 2007
and 2008 by LDC and Hong University of Science and Technology
(HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. Corresponding audio
data is released as GALE Phase 3 Chinese Broadcast Conversation
Speech Part 2 (LDC2015S06).
The transcript files are in plain-text,
tab-delimited format (TDF) with UTF-8 encoding, and the
transcribed data totals 1,388,236 tokens. The transcripts were
created with the LDC-developed transcription tool, XTrans,
a multi-platform, multilingual, multi-channel transcription tool
that supports manual transcription and annotation of audio
recordings.
The files in this corpus were transcribed by
LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDC's quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) both of which
are included in the documentation with this release. QTR
transcription consists of quick (near-) verbatim, time-aligned
transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries
and manual sentence unit annotation to the core components of a
quick transcript. Files with QTR as part of the filename were
developed using QTR transcription. Files with QRTR in the filename
indicate QRTR transcription.
GALE Phase 3 Chinese Broadcast Conversation
Transcripts Part 2 is distributed via web download. 2015
Subscription Members will automatically receive two copies of this
corpus on disc. 2015 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this
data for a fee.
*
(4) SenSem (Sentence
Semantics) Lexicons was developed by GRIAL, the Linguistic
Applications Inter-University Research Group that includes the
following Spanish institutions: the Universitat
Autonoma
de Barcelona, the Universitat de
Barcelona, the Universitat
de Lleida and the Universitat
Oberta de Catalunya. It contains feature descriptions for
approximately 1,300 Spanish verbs and 1,300 Catalan verbs in the
SenSem Databank (LDC2015T02).
GRIAL's work focuses on resources for applied linguistics,
including lexicography, translation and natural language
processing.
The verb features for each language consist of
two groups: those codified manually, including definition, WordNet synset, Aktionsart,
arguments and semantic functions; and those extracted
automatically from the SenSem Databank. Among the latter are verb
frequency, semantic construction, syntactic categories and
constituent order. The verbs analyzed correspond to the 250 most
frequent verbs in Spanish and 320 lemmas in Catalan. Further
information about the SenSem project can be obtained from the GRIAL website. Data
is presented in a single XML file per language.
SenSem Lexicons is distributed via web
download.
2015 Subscription Members will automatically
receive two copies of this corpus on disc. 2015 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee. This data is made
available to LDC not-for-profit members and all non-members under
the Creative
Commons Attribution-Noncommercial Share Alike 3.0 license
and to LDC for-profit members under the terms of the For-Profit
Membership Agreement.