Linguistic Data Consortium: October 2018

In this newsletter:

Fall 2018 LDC Data Scholarship Recipients
Membership Year 2019 Publication Preview

New Publications:
Concretely Annotated English Gigaword
TRAD Arabic-French Parallel Text -- Newswire
TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014
__________________________________________________________________________

Fall 2018 LDC Data Scholarship Recipients

Congratulations to the recipients of LDC's Fall 2018 Data Scholarships:

Utkrist Adhikari: University of Bonn (Germany); M.Sc, Computer Science. Utkrist is awarded a copy of Treebank-2 for his research in named entity recognition, super sense tagging, and semantic role labeling.

Vitaliya Remneva: Higher School of Economics, National Research University (Russia); M.Sc, System and Software Engineering. Vitaliya is awarded a copy of ETS Corpus of Non-Native Written English for her work in author profiling through natural language processing.

Tian Xiaoyu: Shanghai International Studies University (China); MA, Linguistics. Tian is awarded a copy of Tagged Chinese Gigaword Version 2.0 for her research in causative construction variations in Mainland Chinese, Taiwan Chinese, and Singapore Chinese.

W. Victor H. Yarlott: Florida International University (US); Ph.D., School of Computing and Information Sciences. Victor is awarded a copy of ACE2005 Multilingual Training Corpus for his research in relation extraction.

For information about the program, visit the Data Scholarship page.

Membership Year 2019 Publication Preview

The 2019 Membership Year is fast approaching and plans for next year’s publications are in progress. Among the expected releases are:

SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation
Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)
Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)
TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data
IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian
HAVIC Med Progress Test data: web video, metadata, and annotations for developing multimedia systems
BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)

Check your inbox in the coming weeks for more information about membership renewal.

New publications:

(1) Concretely Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to English Gigaword Fifth Edition (LDC2011T07). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization.

Concretely Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition, which consists of newswire stories from seven sources collected by LDC between 1994-2010.

Concretely Annotated English Gigaword is distributed via hard drive.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed English Gigaword Fifth Edition (LDC2011T07) or Annotated English Gigaword (LDC2012T21) may request a copy of Concretely Annotated English Gigaword for a media fee. Non-members may license this data for a fee.

(2) TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014.
Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results.

The regular English Slot Filling evaluation track involved mining information about entities from text. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection. For more information about English Slot Filling, please refer to the 2014 track home page.

This release contains queries, the 'manual runs' (human-produced responses to the queries), and the final rounds of assessment results.

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TRAD Arabic-French Parallel Text -- Newswire was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21). The purpose of the PEA-TRAD project (Translation as a Support for Document Analysis) was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.

This release consists of 813 segments (translations units) from 74 documents. The Arabic source file contains 19,902 words and the French reference translation contains 29,104 words. The source data is Arabic newswire text collected and translated into English by LDC. Information about the ELDA translation team, translation guidelines, and validation results is contained in the documentation accompanying this release.

TRAD Arabic-French Parallel Text -- Newswire is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, October 15, 2018

LDC 2018 October Newsletter