LDC data and commercial technology development
New Publications:
IARPA
Babel Cebuano Language Pack IARPA-babel301b-v2.0b
__________________________________________________________________________
__________________________________________________________________________
LDC Catalog certified as CoreTrustSeal data repository
LDC is pleased to announce that the Catalog has been awarded the CoreTrustSeal for recognition as a
trustworthy data repository. This means that the Catalog meets a series of
standards covering data access, rights management, curation, and storage
developed by the ISCU World Data System and the Data Seal of Approval. LDC
joins the other 136 certified repositories around the globe in the commitment
to promote sustainable and trustworthy data infrastructures.
LDC data and commercial technology development
For-profit organizations are reminded that an LDC
membership is a pre-requisite for obtaining a commercial license to almost all
LDC databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product or for
any commercial purpose. LDC data users should consult corpus-specific license
agreements for limitations on the use of certain corpora. Visit the Licensing
page for further information.
New publications:
(1) BOLT
Chinese SMS/Chat was developed by LDC and consists of
naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected
through data donations and live collection involving native speakers of
Chinese. The corpus contains 14,877 conversations totaling 3,005,810 words
across 497,543 messages.
The BOLT (Broad
Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on
user-generated content. LDC supported the BOLT program by collecting informal
data sources – discussion forums, text messaging, and chat – in Chinese,
Egyptian Arabic, and English. The collected data was translated and annotated
for various tasks including word alignment, treebanking, propbanking, and
co-reference. The data in this release was collected using two methods: new
collection via LDC's collection platform, and donation of SMS or chat archives
from BOLT collection participants.
BOLT Chinese SMS/Chat is distributed via web download.
2018 Subscription Members will automatically receive
copies of this corpus. 2018 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a
fee.
*
(2) Multi-Language
Conversational Telephone Speech 2011 -- Central European was developed
by LDC and is comprised of approximately 44 hours of telephone speech in two
distinct language varieties of Central Europe: Czech and Slovak.
The data were collected primarily to support research and
technology evaluation in automatic language identification, and portions of
these telephone calls were used in the NIST 2011 Language Recognition
Evaluation (LRE). Participants were recruited by native
speakers who contacted acquaintances in their social network. Those native
speakers made one call, up to 15 minutes, to each acquaintance. Human auditors
labeled the calls for callee gender, dialect type, and noise.
LDC has also released the following as part of the
Multi-Language Conversational Telephone Speech 2011 series:
·
Slavic
Group (LDC2016S11)
·
Turkish
(LDC2017S09)
·
South
Asian (LDC2017S14)
·
Central
Asian (LDC2018S03)
Multi-Language Conversational Telephone Speech 2011 --
Central European is distributed via web download.
2018 Subscription Members will automatically receive
copies of this corpus. 2018 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a
fee.
*
(3) TAC
KBP English Entity Linking - Comprehensive Training and Evaluation Data
2009-2013 was developed by LDC
and contains training and evaluation data produced in support of the TAC KBP
English Entity Linking tasks in 2009,
2010, 2011, 2012, and 2013. It includes queries
and gold standard entity type information, Knowledge Base links, and
equivalence class clusters for NIL entities. Also included are the source
documents for the queries, specifically, English newswire, discussion forum,
and web data. The corresponding knowledge base is available as TAC KBP
Reference Knowledge Base (LDC2014T16).
Also included in this package are the results of an Entity Linking IAA
(Inter-Annotator Agreement) study conducted in 2010.
TAC KBP encourages the development of systems that can
match entities mentioned in natural texts with those appearing in a knowledge
base and extract novel information about entities from a document collection
and add it to a new or existing knowledge base. English Entity Linking was
first conducted as part of the 2009 TAC KBP evaluations. Its goal is to measure
systems' ability to determine whether an entity, specified by a query, has a
matching node in a reference knowledge base (KB) and, if so, to create a link
between the two. If there is no matching node for a query entity in the KB, EL
systems are required to cluster the mention together with others referencing
the same entity.
TAC KBP English Entity Linking - Comprehensive Training
and Evaluation Data 2009-2013 is distributed via web download.
2018 Subscription Members will automatically receive
copies of this corpus. 2018 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a
fee.
*
(4) IARPA
Babel Cebuano Language Pack IARPA-babel301b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced
Research Projects Activity) Babel program.
It contains approximately 191 hours of Cebuano conversational and scripted telephone
speech collected in 2013 and 2014 along with corresponding transcripts.
The Cebuano speech in this release represents that spoken
in the Cebu-North Kana, Sialo, and Mindanao dialect regions of the Philippines.
The gender distribution among speakers is approximately equal; speakers' ages
range from 16 years to 75 years. Calls were made using different telephones
(e.g., mobile, landline) from a variety of environments including the street, a
home or office, a public place, and inside a vehicle.
IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b
is available via web download.
2018 Subscription Members will receive copies of this
corpus provided they have submitted a completed copy of the special license agreement.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.