New publications:
_______________________________________________________________
Commercial
use and
LDC data
For-profit organizations are reminded that
an LDC
membership is a pre-requisite for obtaining a commercial license
to almost all
LDC databases. Non-member organizations, including non-member
for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial
product or for
any commercial purpose. LDC data users should consult
corpus-specific
license agreements for limitations on the use of certain corpora.
Visit
our Licensing page for more information.
New Corpora
(1) Chinese
Treebank 9.0 consists of approximately two million words of
annotated and
parsed text from Chinese newswire, government documents, magazine
articles,
various broadcast news and broadcast conversation programs, web
newsgroups,
weblogs, discussion forums, chat messages and transcribed
conversational
telephone speech. This new data set in the Chinese Treebank series
adds more
annotated web data and two new genres – chat messages and
transcribed telephone
speech.
There are 3,726 text files in this release,
containing
132,076 sentences, 2,084,387 words, 3,247,331 characters (hanzi or
foreign).
The data is provided in the UTF-8 encoding, and the annotation has
Penn
Treebank-style labeled brackets. The data is provided in four
different
formats: raw text, word segmented, POS-tagged, and syntactically
bracketed
formats. All files were automatically verified and manually
checked.
Chinese Treebank 9.0 is distributed via web
download.
2016 Subscription Members will automatically
receive two
copies of this corpus. 2016 Standard Members may request a copy as
part of
their 16 free membership corpora. Non-members may license this
data for a fee.
*
(2)
CHM150
(Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of
Engineering at
the National
Autonomous University of Mexico (UNAM) and
consists of
approximately 1.63 hours of Mexican Spanish speech, associated
transcripts, and
speaker metadata. The goal of this work was to support spoken
term detection and
forensic speaker identification.
This
corpus is comprised of Mexican Spanish microphone speech from 75
male speakers
and 75 female speakers in a quiet office environment. Speakers
could answer
pre-selected open questions or describe a particular painting
shown to them on
a computer monitor. Speaker metadata in this release includes
age, gender,
place of birth, place of residence and parents' nationalities.
CHM150
is distributed via web download.
2016 Subscription Members will automatically
receive two
copies of this corpus. 2016 Standard Members may request a copy as
part of
their 16 free membership corpora. This data is being made
available at no-cost
for non-member organizations under a research
license.
*
(3) GALE
Phase 4 Arabic Weblog Parallel Sentences was
developed by LDC. Along with
other corpora, the
parallel text in this release comprised training data for
Phase 4 of the DARPA
GALE (Global Autonomous Language Exploitation) Program. This corpus
contains Modern
Standard Arabic source text and corresponding English
translations, selected
from newsgroup and weblog data collected by LDC and translated
by LDC or under
its direction.
The
data includes 1,067 source-translation document pairs,
comprising 68,346 words
(Arabic source) of translated data.
Sentences were
selected for
translation in two steps. First, files were chosen using
sentence selection
scripts provided by GALE program participants SRI International and IBM. The output
was then manually reviewed by LDC staff to eliminate problematic
sentences.
Selected files were reformatted into a human-readable
translation format and
assigned to translation vendors. Translators followed LDC's
Chinese to English
translation guidelines and were provided with the full source
documents
containing the target sentences for their reference. Bilingual
LDC staff
performed quality control procedures on the completed
translations.
GALE Phase 4 Arabic Weblog Parallel
Sentences is
distributed via web download.
2016 Subscription Members will automatically
receive two
copies of this corpus. 2016 Standard Members may request a copy as
part of
their 16 free membership corpora. Non-members may license this
data for a fee.