LDC celebrates 25
years
LDC data and
commercial technology development
New publications:
_________________________________________________________________________
LDC celebrates 25
years
April 2017 marks the beginning of LDC’s 25th year
as the leader in language resource development and distribution. Founded in
1992, the Consortium has grown from a data repository to a vibrant data center
that creates, shares and archives language resources. The Catalog continues to
grow, boasting over 700 titles in more than 90 languages. With the support of
members, licensees, sponsors and collaborators, LDC has distributed over 120,000
copies of data to more than 3,500 organizations worldwide. Our heartfelt thanks
for your support as we continue our mission to provide large quantities of
diverse data, research program support and high quality member services.
LDC data and
commercial technology development
Any organization
wishing to use LDC data to develop or test products for commercialization or use
LDC data in any commercial product or for any commercial purpose, must first
license the data as a For-Profit Member. Once the data is licensed under the
For-Profit Membership, the organization retains perpetual rights to use the
data for commercial technology development. LDC data users should consult
corpus-specific license agreements for limitations on the use of certain
corpora. Visit our Licensing page for more information.
New Corpora
(1) 2010
NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST
(National Institute of Standards and Technology). It contains 2,255 hours of
American English telephone speech and interview speech recorded over a
microphone channel used as test data in the NIST-sponsored 2010 Speaker
Recognition Evaluation (SRE).
The telephone speech segments include two-channel excerpts
of approximately 10 seconds and 5 minutes. There are also summed-channel
excerpts in the range of 5 minutes. The microphone excerpts are 3-15 minutes in
duration. As in prior evaluations, intervals of silence were not removed.
The 2010 evaluation includes not only conversational
telephone speech (CTS) recorded over ordinary telephone channels for the core training
and test conditions, but also CTS and conversational interview speech recorded
over a room microphone channel. Unlike prior evaluations, some of the
conversational telephone style speech was collected in a manner to produce
particularly high, or particularly low, vocal effort on the part of the speaker
of interest. In addition to evaluation data, this package also consists of
answer keys, trial and train files, development data and evaluation
documentation.
2010 NIST Speaker Recognition Evaluation Test Set is
distributed via hard drive.
2017 Subscription Members will receive copies of this
corpus. 2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(2)
BOLT Egyptian
Arabic SMS/Chat and Transliteration was developed
by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat
(CHT) data collected through data donations and live collection involving
native speakers of Egyptian Arabic. The corpus contains 5,691 conversations
totaling 1,029,248 words across 262,026 messages. Messages were natively
written in either Arabic orthography or romanized Arabizi. A total of 1,856
Arabizi conversations (287,022 words) were transliterated from the original
romanized Arabizi script into standard Arabic orthography and then reviewed,
corrected and normalized by LDC annotators according to "Conventional
Orthography for Dialectal Arabic" (CODA).
The BOLT (Broad Operational Language Translation)
program developed machine translation and information retrieval for less formal
genres, focusing particularly on user-generated content. LDC supported the BOLT
program by collecting informal data sources -- discussion forums, text
messaging and chat -- in Chinese, Egyptian Arabic and English. The collected
data was translated and annotated for various tasks including word alignment,
treebanking, propbanking and co-reference.
BOLT Egyptian Arabic
SMS/Chat and Transliteration is distributed via web download.
2017 Subscription Members will receive copies of this
corpus. 2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(3) CHiME2 Grid was developed as part of The 2nd CHiME
Speech Separation and Recognition Challenge and contains
approximately 120 hours of English speech from a noisy living room environment.
The CHiME Challenges focus on distant-microphone automatic speech recognition
(ASR) in real-world environments.
CHiME2 Grid reflects the small
vocabulary track of
the CHiME2 Challenge. The target utterances were taken from the Grid corpus and
consist of 34 speakers reading simple 6-word sequences. The Data is divided into training, development and test
sets.
CHiME2 Grid is distributed via web download.
2017 Subscription Members will receive copies of this
corpus. 2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.