LREC Workshop on Citizen Linguistics - Deadline Extended
New Publications:
TAC KBP English Event Argument - Training and
Evaluation Data 2014-2015
Chinese CogBank
Machine Reading Phase 1 ICTraining Data
IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b
Chinese CogBank
Machine Reading Phase 1 ICTraining Data
IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b
__________________________________________________________________________
Only two weeks left to enjoy 2020 membership discounts
There is still time to save on 2020 membership fees. Through March 2, all organizations receive a discount on the 2020 membership fee (up to 10%) when they choose to join or renew. For more information on membership benefits, visit Join LDC.
Only two weeks left to enjoy 2020 membership discounts
There is still time to save on 2020 membership fees. Through March 2, all organizations receive a discount on the 2020 membership fee (up to 10%) when they choose to join or renew. For more information on membership benefits, visit Join LDC.
LREC Workshop on Citizen Linguistics -
Deadline Extended
LDC Researchers and their colleagues are organizing a workshop on Citizen Linguistics
and Language Resource Development at LREC 2020 (Language Resource and Evaluation Conference) to take place
on May 16, 2020. The workshop includes an open call for papers in
language-related citizen science, a tutorial on using the new LanguageARC.org citizen linguistics portal and a special session on best papers using
LanguageARC. Call for Papers deadline extended until February 24, 2020. _______________________________________________________________________
New publications:
(1) TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the 2014 TAC KBP English Event Argument Extraction Pilot and Evaluation tasks and the 2015 English Event Argument Extraction and Linking Training and Evaluation tasks.
New publications:
(1) TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the 2014 TAC KBP English Event Argument Extraction Pilot and Evaluation tasks and the 2015 English Event Argument Extraction and Linking Training and Evaluation tasks.
The Event Argument Extraction and Linking task required systems
to extract event arguments (entities or attributes playing a role in an event)
from unstructured text, indicate the role they play in an event, and link the
arguments appearing in the same event to each other. Since the extracted
information must be suitable as input to a knowledge base, systems constructed
tuples indicating the event type, the role played by the entity in the event,
and the most canonical mention of the entity from the source document. The
event types and roles were drawn from an externally-specified ontology of 31
event types, which included financial transactions, communication events, and
attacks.
This corpus includes source documents, manual runs, assessments,
and event hoppers, a form of identity coreference for events (2015 only). Source data is English newswire and
discussion forum text collected by LDC.
TAC KBP English
Event Argument - Training and Evaluation Data 2014-2015 is distributed via web
download.
2020 Subscription
Members will automatically receive copies of this corpus. 2020 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
The data was collected via the Chinese search engine Baidu.com. The original collection consisted of 1,258,430 types (5,637,500 tokens) of "word-adjective" pairs that were reduced in Chinese CogBank to 232,497 "word-property" pairs after a series of manual checks.
*
(2) Chinese CogBank is a
database of cognitive properties of Chinese words intended for use in metaphor
understanding and generation. It consists of 232,497 "word-property"
pairs, which are comprised of 83,104 words and 100,195 properties. Each
"word-property" type also has an associated frequency which can stand
as a functional measure of the importance of a property.The data was collected via the Chinese search engine Baidu.com. The original collection consisted of 1,258,430 types (5,637,500 tokens) of "word-adjective" pairs that were reduced in Chinese CogBank to 232,497 "word-property" pairs after a series of manual checks.
2020 Subscription
Members will automatically receive copies of this corpus. 2020 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(3) Machine
Reading Phase 1 IC Training Data
was developed by LDC for use in the DARPA (Defense Advanced Research Projects
Agency) Machine Reading program. It contains 248 English source documents and
116 standoff annotation files, annotated with instances of explicit relations
and their arguments, as well as some non-explicit relations.
The Machine Reading
program aimed to develop automated reading systems to bridge the gap between
knowledge contained in natural language texts and knowledge accessible to
formal reasoning systems. The reading systems designed by program participants
were required to extract and reason about facts from text in multiple domains.
The data in this
release constitutes the training data for the IC (Core Domain) task, which
tested the core domain by extracting information about Entities (people,
organizations, geopolitical entities) and their involvement in four types of
Relations (Attack Relations, Biographical Relations, Affiliation Relations and
Family Relations), as described in newswire text. This information was then
aligned with an IC Use Cases ontology that would allow automated reasoning
about the extracted Entities and Relations.
Machine Reading
Phase 1 IC Training Data is distributed via web download.
2020 Subscription
Members will automatically receive copies of this corpus. 2020 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(4) IARPA
Babel Dholuo Language Pack IARPA-babel403b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced
Research Projects Activity) Babel program. It contains approximately 204 hours of Dholuo
conversational and scripted telephone speech collected in 2014 and 2015 along
with corresponding transcripts.
The Dholuo speech in
this release represents the South Nyanza and Trans-Yala dialect regions of Kenya.
The gender distribution among speakers is approximately equal; speakers' ages
range from 16 years to 65 years. Calls were made using different telephones
(e.g., mobile, landline) from a variety of environments including the street, a
home or office, a public place, and inside a vehicle.
IARPA Babel Dholuo
Language Pack IARPA-babel403b-v1.0b is distributed via web download.
2020 Subscription
Members will receive copies of this corpus provided they have submitted a
completed copy of the special license agreement. 2020 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*