Monday, February 17, 2020

LDC 2020 February Newsletter

Only two weeks left to enjoy 2020 membership discounts
LREC Workshop on Citizen Linguistics - Deadline Extended 

New Publications:
__________________________________________________________________________ 
Only two weeks left to enjoy 2020 membership discounts 

There is still time to save on 2020 membership fees. Through March 2, all organizations receive a discount on the 2020 membership fee (up to 10%) when they choose to join or renew. For more information on membership benefits, visit Join LDC.

LREC Workshop on Citizen Linguistics - Deadline Extended

LDC Researchers and their colleagues are organizing a workshop on Citizen Linguistics and Language Resource Development at LREC 2020 (Language Resource and Evaluation Conference) to take place on May 16, 2020. The workshop includes an open call for papers in language-related citizen science, a tutorial on using the new LanguageARC.org citizen linguistics portal and a special session on best papers using LanguageARC. Call for Papers deadline extended until February 24, 2020. _______________________________________________________________________

New publications:
 

(1) TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the 2014 TAC KBP English Event Argument Extraction Pilot and Evaluation tasks and the 2015 English Event Argument Extraction and Linking Training and Evaluation tasks

The Event Argument Extraction and Linking task required systems to extract event arguments (entities or attributes playing a role in an event) from unstructured text, indicate the role they play in an event, and link the arguments appearing in the same event to each other. Since the extracted information must be suitable as input to a knowledge base, systems constructed tuples indicating the event type, the role played by the entity in the event, and the most canonical mention of the entity from the source document. The event types and roles were drawn from an externally-specified ontology of 31 event types, which included financial transactions, communication events, and attacks. 

This corpus includes source documents, manual runs, assessments, and event hoppers, a form of identity coreference for events (2015 only).  Source data is English newswire and discussion forum text collected by LDC. 

TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Chinese CogBank is a database of cognitive properties of Chinese words intended for use in metaphor understanding and generation. It consists of 232,497 "word-property" pairs, which are comprised of 83,104 words and 100,195 properties. Each "word-property" type also has an associated frequency which can stand as a functional measure of the importance of a property.

The data was collected via the Chinese search engine Baidu.com. The original collection consisted of 1,258,430 types (5,637,500 tokens) of "word-adjective" pairs that were reduced in Chinese CogBank to 232,497 "word-property" pairs after a series of manual checks.
Chinese CogBank is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.



(3) Machine Reading Phase 1 IC Training Data was developed by LDC for use in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. It contains 248 English source documents and 116 standoff annotation files, annotated with instances of explicit relations and their arguments, as well as some non-explicit relations.
  
The Machine Reading program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.
  
The data in this release constitutes the training data for the IC (Core Domain) task, which tested the core domain by extracting information about Entities (people, organizations, geopolitical entities) and their involvement in four types of Relations (Attack Relations, Biographical Relations, Affiliation Relations and Family Relations), as described in newswire text. This information was then aligned with an IC Use Cases ontology that would allow automated reasoning about the extracted Entities and Relations. 

Machine Reading Phase 1 IC Training Data is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Dholuo conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts. 

The Dholuo speech in this release represents the South Nyanza and Trans-Yala dialect regions of Kenya. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. 

IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b is distributed via web download. 

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

Wednesday, January 15, 2020

LDC 2020 January Newsletter

Renew Your LDC Membership Today
LREC Workshop for Citizen Linguistics – Call for Papers 

New Publications: 
Abstract Meaning Representation(AMR) Annotation Release 3.0 
Database of Word Level Statistics – Mandarin 
LibriVox Spanish 
________________________________________________________________________

Renew Your LDC Membership Today

Join LDC for MY2020 while membership savings are still available. Now through March 2, 2020, renewing MY2019 members receive a 10% discount off the 2020 membership fee. New or returning member organizations receive a 5% discount. This year’s planned publications include Mixer 4 and 5 Speech (English telephone speech and interviews), IARPA Babel Language Packs (telephone speech and transcripts in underserved languages), and data from BOLT, DEFT, RATS, TAC KBP and more. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

LREC Workshop on Citizen Linguistics

LDC Researchers and their colleagues are organizing a workshop on Citizen Linguistics and Language Resource Development at LREC 2020 (Language Resource and Evaluation Conference) to take place on May 16, 2020. The workshop includes an open call for papers in language-related citizen science, a tutorial on using the new LanguageARC.org citizen linguistics portal and a special session on best papers using LanguageARC.
________________________________________________________________________  

New publications:
(1) Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release updates Abstract Meaning Representation 2.0 (LDC2017T10) with new data, more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

Abstract Meaning Representation (AMR) Annotation Release 3.0 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.  

*
(2) Database of Word Level Statistics – Mandarin was developed by The Hong Kong Polytechnic University. It provides lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese. It is designed for researchers particularly concerned with language processing of isolated words. Invariant characteristics include each item's lexicality, sampa, pinyin, IPA transcription, lexical tone, syllable structure, syllable length, pinyin length, segment length, dominant PoS, lexical frequency of the dominant PoS, percent of that dominant PoS, and other PoSes associated with the given item.

Database of Word Level Statistics – Mandarin is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
* 

(3) LibriVox Spanish consists of approximately 73 hours of Spanish read speech and transcripts. The audio data was taken from Spanish audiobooks developed by LibriVox, a non-profit project that creates audiobooks from public domain works. The transcripts were developed for this release.
  
The audio is comprised of sentences from 300 books read by 154 speakers (77 men and 77 women), representing native and non-native Spanish read speech. Audio files were manually segmented and are between three and ten seconds in length. Native Spanish speakers transcribed the audio data.

LibriVox Spanish is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.