LDC Membership Discounts for MY2017 Still Available
New publications:
LDC Membership Discounts for MY2017 Still Available
Join
LDC now while membership savings are still available. 2016 members receive a
10% discount when renewing before March 1, 2017, or a 5% discount when renewing
any time in 2017. Non-consecutive members and new members receive a 5% discount
when renewing before March 1, 2017. Membership remains the most economical way to
access LDC releases. This year’s planned
publications include 2010 NIST Speaker Recognition Evaluation data set, Multilanguage
Conversational Telephone Speech, Noisy TIMIT, IARPA Babel Language Packs, RATS
Keyword Spotting, BOLT parallel and word-aligned data in all languages and
more. Browse the Members
pages for details on membership options and benefits.
New Corpora
(1) Arabic Speech Recognition Pronunciation
Dictionary
was developed by the Qatar
Computing Research Institute. It contains
approximately two million pronunciation entries for 526,000 Modern Standard
Arabic words, for an average of 3.84 pronunciations for each grapheme word. The
dictionary was developed from news archive resources, including the Arabic news
website Aljazeera.net. The selected words were those that occurred more than once
in the news collection.
Arabic Speech Recognition Pronunciation Dictionary is
distributed via web download.
2017
Subscription Members will receive copies of this corpus. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(2) IARPA
Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen
for the IARPA (Intelligence Advanced Research Projects Activity) Babel program.
It contains approximately 201 hours
of Vietnamese conversational and scripted telephone speech collected in
2012 along with corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Vietnamese
speech in this release represents that spoken in the North, North-Central,
Central and Southern dialect regions in Vietnam. The gender distribution among
speakers is approximately equal;
speakers' ages range from 16 years to 64 years. Calls were made using
different telephones (e.g., mobile, landline) from a variety of environments
including the street, a home or office, a public place, and inside a vehicle.
Transcripts are encoded in UTF-8.
IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7
is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee..
*
(3) MWE-Aware English Dependency Corpus was developed by the Nara
Institute of Science and Technology Computational Linguistics Laboratory and consists of
English compound function words annotated in dependency format. The data is
derived from the Wall Street Journal portion of OntoNotes Release 5.0 (LDC2013T19).
Compound function words are a type of multiword expression
(MWE). MWEs are groups of tokens that can be treated as a single semantic or
syntactic unit. Doing so facilitates natural language processing tasks such as
constituency and dependency parsing.
MWE-Aware English Dependency Corpus is distributed via
web download.
2017
Subscription Members will receive copies of this corpus. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee..
*
(4) GALE Phase 3 and 4 Chinese Web Parallel Text was developed by LDC and
contains Chinese source text and corresponding
English translations selected from weblog and newsgroup data collected by LDC
and translated by LDC or under its direction.
The
data includes 88 source-translation document
pairs, comprising 67,514 tokens of Chinese source text and its English
translation.
GALE Phase 3 and 4 Chinese Web
Parallel Text is distributed via web download.
2017
Subscription Members will receive copies of this corpus. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.