In this newsletter:
Join LDC for Membership Year
2017
Commercial use and LDC data
Spring 2017 Data Scholarship
Program
LDC closed November 24-25 for
US Thanksgiving Holiday
New publications:
Join LDC for Membership Year 2017
Organizations engaged in language-related research, education and technology development are invited to join LDC for Membership Year (MY) 2017. Consortium members enjoy unparalleled access and continuing rights to new data releases and to an archive of close to 700 holdings.
Organizations engaged in language-related research, education and technology development are invited to join LDC for Membership Year (MY) 2017. Consortium members enjoy unparalleled access and continuing rights to new data releases and to an archive of close to 700 holdings.
Membership fees have not increased for
2017. In addition, discounts are available for organizations who keep their
membership current and for those who join before March 1, 2017.
• MY 2016 members receive a 10% discount if they renew their membership before March 1, 2017. After March 1, MY2016 members receive a 5% discount if they renew their membership any time in 2017.
• MY 2016 members receive a 10% discount if they renew their membership before March 1, 2017. After March 1, MY2016 members receive a 5% discount if they renew their membership any time in 2017.
• New members and returning former members receive a 5%
discount off the membership fee if they join/renew before March 1, 2017.
Plans for MY2017 publications are in progress. Among the expected releases are:
Plans for MY2017 publications are in progress. Among the expected releases are:
2010 NIST Speaker Recognition
Evaluation data set
Multilanguage conversational telephone
speech: developed to support language identification research in related
languages
UCLA
High Speed Laryngeal Database: audio recordings and high-speed videoendoscopic
images of the vocal folds while sustaining vowels
Noisy TIMIT: TIMIT with added
artificial noise
CHiME shared task data: noisy read WSJ
speech
First Year Law Students’ Memoranda:
memos to a hypothetical court with annotations
IARPA Babel Language Packs: languages include
Vietnamese, Haitian Creole, Zulu, Kazakh and Lithuanian
BOLT: source, parallel and word-aligned
data in all languages
RATS Keyword Spotting data set
GALE Phases 3 and 4: all tasks and
languages
Visit Join LDC
for details on membership, user accounts and payment.
Commercial use and LDC data
Commercial use and LDC data
For-profit organizations are reminded
that an LDC membership is a pre-requisite for obtaining a commercial license to
almost all LDC databases. Non-member organizations, including non-member
for-profit organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product or for
any commercial purpose. LDC data users should consult corpus-specific
license agreements for limitations on the use of certain corpora. Visit the Licensing
page for further information.
Spring 2017 Data
Scholarship Program
Applications are now being accepted
through January 15, 2017 for the Spring 2017 LDC Data Scholarship program which
provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship
page for further information about program rules and submission requirements.
LDC closed November 24-25 for US Thanksgiving Holiday
LDC will be closed on Thursday, November 24, 2016 and Friday, November 25, 2016 in observance of the US Thanksgiving Holiday. The office will reopen on Monday, November 28, 2016.
LDC closed November 24-25 for US Thanksgiving Holiday
LDC will be closed on Thursday, November 24, 2016 and Friday, November 25, 2016 in observance of the US Thanksgiving Holiday. The office will reopen on Monday, November 28, 2016.
New Corpora
(1) JANA:
A Human-Human Dialogues Corpus for Egyptian Dialect was developed by
researchers at Cairo University. This is a special release in addition to the
LDC scheduled corpora for membership year 2016, available under separate terms.
This corpus consists of 82 transcribed dialogues from
call center inquiries annotated for dialogue acts. Data was collected from call
centers for banks, airlines and mobile network providers in the form of spontaneous
spoken telephone dialogues (52) and instant messaging dialogues (30) amounting
to over 20,000 words.
Not-for-profit organizations may license this data set for
a fee under the LDC Not-for-Profit Membership Agreement or under the LDC
User Agreement for Non-Members for use in linguistic research, education and
non-commercial technology development. For-profit organizations may license
this data for a fee under a commercial license.
(2) Multi-Language
Conversational Telephone Speech 2011 – Slavic Group was developed by LDC
and is comprised of approximately 60 hours of telephone speech in Polish,
Russian and Ukrainian. The data was collected to support research and
technology evaluation in automatic language identification, specifically
language pair discrimination for closely related languages/dialects.
Call were made using LDC’s telephone collection
infrastructure. Human auditors labeled calls for gender, dialect type and
noise. Audio data is presented in
FLAC-compressed MS-WAV (RIFF) file format. Each uncompressed file is two
channels, recorded at 8000 samples/second with samples stored as 16-bit signed
integers.
Multi-Language Conversational Telephone Speech 2011 –
Slavic Group is distributed via web download.
2016 Subscription Members will receive copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
(3) IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a
was developed by Appen for the IARPA (Intelligence Advanced Research Projects
Activity) Babel program. It contains approximately 190 hours of Georgian
conversational and scripted telephone speech collected in 2014-2015 along with
corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Georgian
speech in this release represents that spoken in the Eastern and Western dialect
regions in Georgia. The gender distribution among speakers is
approximately equal; speakers' ages
range from 16 years to 73 years. Calls were made using different
telephones (e.g., mobile, landline) from a variety of environments including
the street, a home or office, a public place, and inside a vehicle.
Transcripts are encoded in UTF-8.
IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a is distributed via web
download.
2016 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2016 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
(4) GALE Phase 3 and 4 Chinese Newswire Parallel Text was developed by LDC and
contains Chinese source text and corresponding English translations selected
from newswire data collected by LDC in 2007-2008 and translated by LDC or under
its direction.
This
release includes 367 source-translation document pairs drawn from five distinct
newswire sources, comprising 210,048 tokens of Chinese source text and its
English translation. Source data and translations are distributed in TDF
format. All data is encoded in UTF-8.
GALE
Phase 3 and 4 Chinese Newswire Parallel Text is distributed via web
download.
No comments:
Post a Comment