Spring 2018 LDC Data Scholarship recipients
LDC data and commercial technology development
New Publications:
Multi-Language Conversational Telephone Speech 2011 -- Central Asian
LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text
TAC KBP Comprehensive English Source Corpora 2009-2014
IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e
______________________________________________________________
New Publications:
Multi-Language Conversational Telephone Speech 2011 -- Central Asian
LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text
TAC KBP Comprehensive English Source Corpora 2009-2014
IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e
______________________________________________________________
Only
two weeks left to enjoy 2018 membership discounts
There is still time to save on 2018 membership fees.
Through March 1, all organizations receive a discount on the 2018 membership
fee (up to 10%) when they choose to join or renew.
For more information on membership benefits, visit Join LDC.
For more information on membership benefits, visit Join LDC.
Spring
2018 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Spring 2018 Data Scholarship:
Congratulations to the recipients of LDC's Spring 2018 Data Scholarship:
Margarida Madaleno: London School of Economics, PhD
Economic Geography. Madelano is awarded a copy of Treebank 3 for her research
in emotional well-being.
Gary Munnelly: Trinity College Dublin, PhD Computer
Science and Statistics. Munnelly is awarded a copy of the New York Times Annotated Corpus
for his research in named entity recognition and disambiguation in cultural
heritage data sets.
Barlian Henryanu Prasetio: University of Miyazaki, PhD
Environmental Robotics. Prasetio is awarded copies of SUSAS and SUSAS Transcripts for his
work in voice stress recognition systems.
For information about the program, visit the Data
Scholarship page.
LDC
data and commercial technology development
For-profit organizations are reminded that an LDC membership
is a pre-requisite for obtaining a commercial license to almost all LDC
databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product or for
any commercial purpose. LDC data users should consult corpus-specific license
agreements for limitations on the use of certain corpora. Visit the Licensing
page for further information.
New publications:
(1) Multi-Language Conversational
Telephone Speech 2011 – Central Asian was developed by LDC and is comprised
of approximately 37 hours of telephone speech in three distinct language
varieties of Central Asia: Dari, Farsi and Pashto.
The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE).
Participants were recruited by native speakers who contacted acquaintances in
their social network. Those native speakers made one call, up to 15 minutes, to
each acquaintance.
LDC has also
released the following as part of the Multi-Language Conversational Telephone
Speech 2011 series:
- Slavic Group (LDC2016S11)
- Turkish (LDC2017S09)
- South Asian (LDC2017S14)
Multi-Language Conversational
Telephone Speech 2011 – Central Asian is distributed via web
download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(2) LORELEI Amharic Representative
Language Pack - Monolingual and Parallel Text was developed by LDC and is comprised of approximately 25
million words of monolingual Amharic text, approximately 600,000 of which are
translated into English. Another 80,000 words are also translated from English
into Amharic. The LORELEI (Low Resource Languages for Emergent Incidents)
Program is concerned with building human language technology for low resource
languages in the context of emergent situations like natural disasters or
disease outbreaks.
Data was collected in the following genres:
discussion forums, news, reference, social network and weblog. Both monolingual
text collection and parallel text creation involved a combination of manual and
automatic methods, which are detailed in the included documentation. All
harvested content was initially converted from its original HTML form into a
relatively uniform XML format. Also included in this release are two tools: one
to recreate original source data from the processed XML material and the other
to condition text data users download from Twitter.
LORELEI Amharic Representative Language Pack -
Monolingual and Parallel Text is distributed via web download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(3) TAC KBP Comprehensive English
Source Corpora 2009-2014 was developed by LDC and
contains the 3,877,207 English source documents used in support of the TAC KBP
tasks from 2009-2014. Text Analysis Conference (TAC) is a series of
workshops organized by the National Institute of Standards and Technology (NIST). TAC was
developed to encourage research in natural language processing and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through its
various evaluations, the Knowledge Base Population (KBP) track of TAC
encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new or
existing knowledge base.
The source data consists of newswire,
broadcast material, and web text collected by LDC. Documents are released as a
collection of zip files for overall compactness, and ease and efficiency of
use. When unpacked, the documents are all UTF-8 text files with a basic markup
structure.
TAC KBP Comprehensive English
Source Corpora 2009-2014 is distributed via web download.
2018 Subscription Members will
receive copies of this corpus. 2018 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data for a
fee.
*
(4) IARPA Babel Tok
Pisin Language Pack IARPA-babel207b-v1.0e was developed by Appen for the IARPA (Intelligence
Advanced Research Projects Activity) Babel program.
It contains approximately 200 hours of Tok Pisin conversational and scripted
telephone speech collected in 2013 along with corresponding transcripts.
The Tok Pisin speech in this release represents that spoken in the
Papuan dialect region of Papua New Guinea. The gender distribution among
speakers is approximately equal; speakers' ages range from 16 years to 65
years. Calls were made using different telephones (e.g., mobile, landline) from
a variety of environments including the street, a home or office, a public place,
and inside a vehicle.
IARPA Babel Tok Pisin Language Pack
IARPA-babel207b-v1.0e is available via web download.
2018 Subscription Members will receive copies of this
corpus provided they have submitted a completed copy of the special license
agreement. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.