New publications:
__________________________________________________________________
New publications:
(1) BOLT Chinese Discussion Forum
Parallel Training Data was developed by LDC and
consists of 1,876,799 tokens of Chinese discussion forum data collected for the
DARPA BOLT program along with their corresponding English translations.
The BOLT (Broad Operational Language Translation)
program developed machine translation and information retrieval for less formal
genres, focusing particularly on user-generated content. LDC supported the BOLT
program by collecting informal data sources -- discussion forums, text
messaging and chat -- in Chinese, Egyptian Arabic and English. The collected
data was translated and annotated for various tasks including word alignment,
treebanking, propbanking and co-reference.
The source data in this release
consists of discussion forum threads harvested from the Internet by LDC using a
combination of manual and automatic processes. The full source data collection
is released as BOLT Chinese Discussion Forums (LDC2016T05). Word-aligned and tagged data is released as BOLT
Chinese-English Word Alignment and Tagging - Discussion Forum Training (LDC2016T19).
BOLT
Chinese Discussion Forum Parallel Training Data is distributed via web
download.
2017
Subscription Members will automatically receive copies of this corpus. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.
*
(2) IARPA Babel Swahili Language
Pack IARPA-babel202b-v1.0d was developed
by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 200
hours of Swahili conversational and scripted
telephone speech collected from 2012-2014 along with corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Swahili speech in this release represents
that spoken in the Nairobi dialect region of Kenya.
The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65
years. Calls were made using different telephones (e.g., mobile,
landline) from a variety of environments including the street, a home or
office, a public place, and inside a vehicle.
Transcripts are encoded in UTF-8.
IARPA Babel Swahili Language
Pack IARPA-babel202b-v1.0d is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(3) Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the
TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been
modified; the original arrangement of the TIMIT corpus is still as described by
the TIMIT documentation.
The
additive noise are white, pink, blue, red, violet and babble noise with levels
varying in 5 dB (decibel) steps, ranging from 5 to 50 dB. The color noise types
were generated artificially using MATLAB. The babble noise was selected from a
random segment of recorded babble speech scaled relative to the power of the
original TIMIT audio signal.
Noisy
TIMIT Speech is distributed via web download.
2017
Subscription Members will automatically receive copies of this corpus. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.
*
(4) GALE English-Chinese Parallel
Aligned Treebank -- Training was developed by LDC
and contains 196,123 tokens of word aligned English and Chinese parallel text
with treebank annotations. This material was used as training data in the DARPA
GALE (Global Autonomous Language Exploitation) program.
Parallel
aligned treebanks are treebanks annotated with morphological and syntactic
structures aligned at the sentence level and the sub-sentence level. Such data
sets are useful for natural language processing and related fields, including
automatic word alignment system training and evaluation, transfer-rule extraction,
word sense disambiguation, translation lexicon extraction and cultural heritage
and cross-linguistic studies. With respect to machine translation system
development, parallel aligned treebanks may improve system performance with
enhanced syntactic parsers, better rules and knowledge about language pairs and
reduced word error rate.
The
English source data was translated into Chinese. Chinese and English treebank
annotations were performed independently. The parallel texts were then word
aligned. The material in this release corresponds to portions of the treebanked
data in OntoNotes 3.0 (LDC2009T24) and
OntoNotes 4.0 (LDC2011T03).
This
release consists of English source broadcast programming (CNN, NBC/MSNBC) and
web data collected by LDC in 2005 and 2006.
GALE
English-Chinese Parallel Aligned Treebank – Training is distributed via
web download.
2017
Subscription Members will automatically receive copies of this corpus. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.