Renew your LDC membership today
Spring 2017 LDC Data Scholarship Program - deadline approaching
LDC to close for Winter Break
Spring 2017 LDC Data Scholarship Program - deadline approaching
LDC to close for Winter Break
New
publications:
_____________________________________________________________________
Renew your LDC membership today
Membership Year 2017 (MY2017) is open
for joining and discounts are available for those who keep their membership
current and join early in the year. Now through March 1, 2017, current MY2016
members who renew before March 1, will receive a 10% discount off of the
membership fee. New or returning organizations will receive a 5% discount
through March 1.
In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of almost 700 holdings; current year for-profit members may use most data for commercial applications.
In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of almost 700 holdings; current year for-profit members may use most data for commercial applications.
Plans
for MY2017 publications are in progress. Among the expected releases are:
- 2010 NIST Speaker Recognition Evaluation data set
- Multilanguage conversational telephone speech: developed to support language identification research in related languages
- UCLA High Speed Laryngeal Database: audio recordings and high-speed video endoscopic images of the vocal folds while sustaining vowels
- Noisy TIMIT: TIMIT with added artificial noise
- CHiME shared task data: noisy read WSJ speech
- First Year Law Students’ Memoranda: memos to a hypothetical court with annotations
- IARPA Babel Language Packs: languages include Vietnamese, Haitian Creole, Zulu, Kazakh and Lithuanian
- BOLT: source, parallel and word-aligned data in all languages
- RATS Keyword Spotting data set
- GALE Phases 3 and 4: all tasks and languages
And don’t forget, MY2016 and MY2015 are
still open for joining. MY2015 can be joined through December 31, 2016 and
includes data such as RATS Speech Activity Detection and updates to Penn
Treebank. MY 2016 will remain open through December 31, 2017 and includes data
such as BOLT Chinese Discussion Forums, IARPA Babel Language Packs and Multi-Language
Conversational Telephone Speech – Slavic Group. For full descriptions of these
data sets, visit our Catalog.
Visit
Join LDC for details on
membership, user accounts and payment.
Spring 2017 LDC Data Scholarship Program - deadline approaching
Students can apply for the Spring 2017 Data Scholarship Program
now through January 16, 2017, 11:59PM EST. The LDC Data Scholarship
program provides undergraduate and graduate students with access to LDC data at
no cost.
For more information on application requirements
and program rules, please visit LDC Data Scholarships. Students can email their applications
to the LDC Data
Scholarships program. Decisions will be sent by
email from the same address.
LDC to close for Winter Break
LDC
will be closed from Monday, December 26, 2016 through Monday, January 2, 2017
in accordance with the University of Pennsylvania Winter Break Policy. Our
offices will reopen on Tuesday, January 3, 2017. Requests received for
membership renewals and corpora during the Winter Break will not be processed
until the week of January 3.
New
Corpora
(1)
Bamanankan
Lexicon was developed by LDC and contains 5,978
entries of the Bamanankan language presented as a Bamanankan-English lexicon
and a Bamanankan-French lexicon. It is the third publication in an LDC project
to build an electronic dictionary of three Mandekan languages: Mawukakan,
Maninkakan and Bamanankan. These are Eastern Manding languages in the Mande
Group of the Niger-Congo language family. LDC released a Mawukakan Lexicon (LDC2005L01) in 2005 and a Maninkakan Lexicon (LDC2013L01) in 2013.
This lexicon is presented using a Latin-based
transcription system because the Latin alphabet is familiar to the majority of
Mandekan language speakers and it is expected to facilitate the work of
researchers interested in this resource.
Bamanankan
Lexicon is distributed via web download.
2016
Subscription Members will receive copies of this corpus. 2016 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(2)
IARPA Babel Tagalog Language
Pack IARPA-babel106-v0.2g was developed by Appen for the IARPA (Intelligence Advanced
Research Projects Activity) Babel program. It contains approximately 213 hours
of Tagalog conversational and scripted telephone speech collected in 2012 along
with corresponding transcripts.
The
Babel program focuses on underserved languages and seeks to develop speech
recognition technology that can be rapidly applied to any human language to
support keyword search performance over large amounts of recorded speech.
The Tagalog speech in this release represents
that spoken in the North, Central and South dialect regions in the Philippines.
The gender
distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years.
Calls were made using different telephones (e.g., mobile, landline) from a
variety of environments including the street, a home or office, a public place,
and inside a vehicle.
Transcripts
are encoded in UTF-8.
IARPA
Babel Tagalog Language Pack IARPA-babel106-v0.2g is distributed via web download.
2016
Subscription Members will receive copies of this corpus provided they have submitted a completed copy of
the special license agreement. 2016 Standard Members may request a copy
as part of their 16 free membership corpora. Non-members may license this data
for a fee.
*
More information about the TAC KBP Entity Linking task and other TAC KBP evaluations can be found on the NIST TAC website.
TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive
Training and Evaluation Data 2012-2014 is distributed via web download.
2016
Subscription Members will receive copies of this corpus. 2016 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(4)
GALE Phase 4
Arabic Newswire Parallel Sentences was developed by LDC and contains Modern
Standard Arabic source text and corresponding English translations selected
from newswire data collected by LDC in 2008 and translated by LDC or under its
direction.
This
release includes 393 source-translation document pairs drawn from six distinct
newswire sources, comprising 62,669 tokens of Arabic source text and its
English translation. Source data and translations are distributed in TDF
format. All data is encoded in UTF-8.
GALE
Phase 4 Arabic Newswire Parallel sentences is distributed via web download.
2016
Subscription Members will receive copies of this corpus. 2016 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.