LDC data and commercial technology development
New Publications:
BOLT English Treebank - Discussion Forum
Polish Speech Database
2016 NIST Speaker Recognition Evaluation Test Set
______________________________________________________________
Membership Year 2020 Publication Preview
The 2020 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:
New Publications:
BOLT English Treebank - Discussion Forum
Polish Speech Database
2016 NIST Speaker Recognition Evaluation Test Set
______________________________________________________________
Membership Year 2020 Publication Preview
The 2020 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:
Abstract
Meaning Representation (AMR) Annotation Release 3.0: semantic treebank of over
59,000 English natural language sentences from broadcast conversations,
newswire, weblogs and web discussion forums; updates the second version (LDC2017T10) with new annotations
TAC
KBP: English sentiment slot filling, surprise slot filling, nugget
detection and coreference, and event argument data in all languages (English,
Chinese and Spanish)
DEFT
Chinese ERE: Chinese discussion forum data annotated for entities,
relations and events
LibriVox
Spanish: 73 hours of Spanish audiobook read speech and transcripts
IARPA
Babel Language Packs (telephone speech and transcripts): languages include Dhuluo,
Javanese and Mongolian
HAVIC
Med Training data: web video, metadata, and annotations for developing multimedia
systems
RATS
Speaker Identification: conversational telephone speech in Levantine
Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation
of speech segments for speaker identification
BOLT:
discussion forums, SMS/chat, conversational telephone speech, word-aligned, tagged
and co-reference data in all languages (Chinese, Egyptian Arabic, and English)
LDC data
and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
______________________________________________________________
New
publications:
(1) BOLT English Treebank -
Discussion Forum was developed by LDC and consists of 268,907 tokens of
English web discussion forum data with part-of-speech and syntactic structure
annotations collected for the DARPA BOLT
(Broad Operational Language Translation) program.
Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.
The source data is English discussion forum web text collected by LDC in 2011 and 2012. A subset of that data -- 702 files representing 268,907 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. The unannotated English source data is released as BOLT English Discussion Forums (LDC2017T11).
BOLT English Treebank - Discussion Forum is distributed via web download.
2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.
The source data is English discussion forum web text collected by LDC in 2011 and 2012. A subset of that data -- 702 files representing 268,907 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. The unannotated English source data is released as BOLT English Discussion Forums (LDC2017T11).
BOLT English Treebank - Discussion Forum is distributed via web download.
2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) Polish Speech Database was developed by VoiceLab and consists of 263,424 utterances of Polish
speech data from 200 speakers, totaling approximately 280 hours, and
corresponding transcripts.
Data collection was performed in Poland. Speakers were asked to
record themselves reading text on a website for at least 60 minutes from their
home computer while using a headset. The read text was comprised of sentences
covering most speech sounds in Polish.
This release includes speaker metadata. There were 103 male
speakers and 97 female speakers, ranging from 15 – 60 years of age; most
speakers were in the 15 – 30 years age range.
Polish Speech Database is
distributed via web download.
2019 Subscription
Members will automatically receive copies of this corpus. 2019 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(3) 2016
NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards
and Technology) and contains approximately 340 hours of short segments of
Tagalog, Cantonese, Cebuano and Mandarin telephone speech used as development
and test data in the NIST-sponsored 2016 Speaker Recognition Evaluation (SRE).
As in previous evaluations, SRE16 focused on telephone speech
recorded over a variety of handset types for the training and test conditions. In
addition to development and evaluation data, this corpus also contains trial
lists, their associated keys, tables containing metadata information, and
evaluation documentation.
The telephone speech data was drawn from the Call My Net 2015
Corpus collected by LDC. Native speakers of Tagalog, Cantonese, Cebuano or
Mandarin (220 unique speakers) made a total of ten telephone calls each to
people within their existing social networks. Speakers were encouraged to use
different telephone instruments in a variety of acoustic settings and were
instructed to talk for 8 - 10 minutes per call on a topic of their choice. All
conversations were collected outside North America.
2016 NIST Speaker Recognition Evaluation Test Set is distributed
via web download.
2019
Subscription Members will automatically receive copies of this corpus. 2019
Standard Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.