Join LDC for Membership Year
2018
Spring 2018 Data Scholarship
Program
Commercial use and LDC data
IARPA Babel Kurmanji Kurdish
Language Pack IARPA-babel205b-v1.0a
TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training & Evaluation Data 2011-2014
TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training & Evaluation Data 2011-2014
____________________________________________________________________
Join
LDC for Membership Year 2018
Membership Year 2018 (MY2018) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2018, current MY2017 members who renew before March 1 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 1.
In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 700 holdings; current year for-profit members may use most data for commercial applications.
Membership Year 2018 (MY2018) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2018, current MY2017 members who renew before March 1 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 1.
In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 700 holdings; current year for-profit members may use most data for commercial applications.
Plans for MY2018 publications are in progress. Among the
expected releases are:
- Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
- DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
- TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
- IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
- BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
- DEFT: Spanish Treebank (newswire, web data)
- RATS: Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
- TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
- German children’s handwriting: longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns
Visit Join LDC for
details on membership, user accounts and payment.
Spring 2018 Data Scholarship Program
Applications
are now being accepted through January 15, 2018 for the Spring 2018 LDC Data
Scholarship program which provides university students with no-cost access to
LDC data. Consult the LDC Data Scholarship page for more information
about program rules and submission requirements.
Commercial use and LDC data
For-profit
organizations are reminded that an LDC membership is a pre-requisite for
obtaining a commercial license to almost all LDC databases. Non-member
organizations, including non-member for-profit organizations, cannot use LDC
data to develop or test products for commercialization, nor can they use LDC
data in any commercial product or for any commercial purpose. LDC data
users should consult corpus-specific license agreements for limitations on the
use of certain corpora. Visit the Licensing page
for further information.
New publications:
(1) ASpIRE Development and
Development Test Sets was developed for the
Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the Intelligent Advanced Research
Projects Activity). It contains approximately 226 hours of English speech with
transcripts and scoring files.
The audio data is a subset of
Mixer 6 Speech (LDC2013S03), audio recordings of interviews, transcript readings and
conversational telephone speech collected by LDC in 2009 and 2010 from native
English speakers local to the Philadelphia area. The transcripts were developed
by Appen for
the ASpIRE challenge.
Data is divided into
development and development test sets.
ASpIRE Development and Development Test Sets is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(2) CIEMPIESS
Light (Corpus de Investigación en Español de México del Posgrado de
Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing
Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM)
and consists of approximately 18 hours of Mexican Spanish radio and television
speech and associated transcripts. The goal of this work was to create acoustic
models for automatic speech recognition. For more information and documentation
see the CIEMPIESS-UNAM Project website.
CIEMPIESS Light is an updated version of CIEMPIESS,
released by LDC as LDC2015S07.
This "light" version contains speech and transcripts presented in a
revised directory structure that allows for use with the Kaldi toolkit.
The
audio files are in 16 kHz, 16-bit PCM flac format, and transcripts are
presented as UTF-8 encoded plain text.
CIEMPIESS Light is
distributed via web download.
2017 Subscription Members will receive copies of this
corpus. 2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(3) IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kurmanji Kurdish conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.
The Kurmanji Kurdish speech in
this release represents that spoken in the southeastern and eastern Anatolian
regions of Turkey. The gender distribution among speakers is approximately 37%
female and 63% male; speakers' ages range from 16 years to 70 years. Calls were
made using different telephones (e.g., mobile, landline) from a variety of
environments including the street, a home or office, a public place, and inside
a vehicle.
IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(4) TACKBP
Chinese Cross-lingual Entity Linking - Comprehensive Training & Evaluation
Data 2011-2014 was developed by LDC and contains training and evaluation
data produced in support of the TAC KBP Chinese Cross-lingual Entity Linking
tasks in 2011, 2012, 2013 and 2014. It includes queries
and gold standard entity type information, Knowledge Base links, and
equivalence class clusters for NIL entities along with the source documents for
the queries, specifically, English and Chinese newswire, discussion forum and
web data. The corresponding knowledge base is available as TAC KBP Reference
Knowledge Base (LDC2014T16).
The goal of TAC KBP’s entity linking track is to measure
systems’ ability to determine whether an entity, specified by a query, has a
matching node in a reference knowledge base and if so, to create a link between
the two. If there is no matching node, entity linking systems are required to
cluster the mention together with others referencing the same entity. More
information about the TAC KBP Entity Linking task and other TAC KBP evaluations
can be found on the NIST TAC website.
TAC KBP Chinese Cross-lingual Entity Linking -
Comprehensive Training and Evaluation Data 2011-2014 is distributed via web
download.
2017 Subscription Members will receive copies of this
corpus. 2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.