New publications:
There is still time for not-for-profit and
government organizations to create a custom data collection of
eight corpora from among LDC’s 2013 releases. Selection options include:
1993-2007 United Nations Parallel Text, Chinese Treebank 8.0, CSC
Deceptive Speech, GALE Arabic and Chinese speech and text
releases, Greybeard, MADCAT training data, NIST 2012 Open Machine
Translation (OpenMT) evaluation and progress sets, and more. The 2013 Data Pack
is available for a flat rate of $3500 through September 15, 2015.
To license the Data Pack and select eight
corpora, login or register for an LDC user account and
add the 2013
Data Pack and each of the eight data sets to your bin.
Follow the check-out procedure, sign all applicable user
agreements and select payment via wire transfer, purchase order or
check. LDC will adjust the invoice total to reflect the data pack
fee.
To pay via credit card, add the 2013 Data Pack
to your bin and check out using the system prompts. At the
completion of the transaction, send an email to ldc@ldc.upenn.edu indicating
the eight data sets to include in your order.
New publications:
(1) CIEMPIESS
(Corpus de Investigación en Español de México del Posgrado de
Ingeniería Eléctrica y Servicio Social) was developed by the Speech
Processing Laboratory of the Faculty of Engineering at the National Autonomous University of
Mexico (UNAM) and consists of approximately 18 hours of
Mexican Spanish radio speech, associated transcripts, pronouncing
dictionaries and language models. The goal of this work was to
create acoustic models for automatic speech recognition.
For more information and documentation see the
CIEMPIESS-UNAM Project website.
The speech recordings are from 43 one-hour FM
radio programs broadcast by Radio
IUS, a UNAM radio station. They are comprised of spontaneous
conversations between a radio moderator and guests, principally
about legal issues. Approximately 78% of the speakers were males,
and 22% of the speakers were females.
The recordings were transcibed using PRAAT, a tool
designed for phonetics research. The transcripts are in Mexbet, a
phonetic alphablet designed for Mexican Spanish based on Worldbet
(Hieronymus, 1994). Plain text transcripts, textgrid format time
labels and files useful for performing experiments with the SPHINX3
recognition software are also included.
CIEMPIESS is distributed via web download.
2015 Subscription Members will automatically
receive two copies of this corpus. 2015 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data at no-cost under the LDC
User
Agreement for Non-Members.
*
(2) GALE Phase 4
Chinese Broadcast Conversation Parallel Sentences was
developed by LDC. Along with other corpora, the parallel text in
this release comprised training data for Phase 4 of the DARPA GALE
(Global Autonomous Language Exploitation) Program. This corpus
contains Chinese source sentences and corresponding English
translations selected from broadcast conversation data collected
by LDC in 2008 and transcribed and translated by LDC or under its
direction.
GALE Phase 4 Chinese Broadcast Conversation
Parallel Sentences includes 109 source-translation document pairs,
comprising 63,829 tokens of Chinese source text and its English
translation. Data is drawn from 17 distinct Chinese programs
broadcast in 2008 from Beijing TV, China Central TV, Hubei TV and
Voice of America.. Broadcast conversation programming is more
interactive than traditional news broadcasts and includes talk
shows, interviews, call-in programs and roundtable discussions.
The programs in this release focus on current events topics.
The data was transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with the
Quick Rich Transcription guidelines developed by LDC. Selected
files were reformatted into a human-readable translation format
and assigned to translation vendors. Translators followed LDC's
Chinese to English translation guidelines and were provided with
the full source documents containing the target sentences for
their reference. Bilingual LDC staff performed quality control
procedures on the completed translations.
GALE Phase 4 Chinese Broadcast Conversation
Parallel Sentences is distributed via web download.
2015 Subscription Members will automatically
receive two copies of this corpus. 2015 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(3) RST Signalling
Corpus was developed at Simon Fraser University and contains
annotations for signalling information added to RST Discourse
Treebank (LDC2002T07).
RST Discourse Treebank (RST-DT) is a collection of English news
texts annotated for rhetorical relations under the RST (Rhetorical
Structure Theory) framework. In RST Signalling Corpus, information
about textual signals -- such as although, because, thus -- and
signals such as tense, lexical chains or punctuation were added as
an annotation layer to examine how rhetorical relations are
signalled in discourse.
The source data consists of 385 Wall Street
Journal news articles from the Penn Treebank
annotated for rhetorical relations in RST Discourse Treebank. As
in RST-DT, the data in this release is divided into a training set
(347 articles) and a test set (38 articles).
The signalling annotation in this data set was
performed using the UAM
CorpusTool version 2.8.12. Files are presented as UTF-8
encoded XML and plain text. The corpus is divided into three
annotation sub-directories: training, test and full. All
sub-directories include source, metadata, signalling annotation,
and dtd files.
RST Signalling Corpus is distributed via web
download.
2015 Subscription Members will automatically
receive two copies of this corpus. 2015 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.