Linguistic Data Consortium: German

Thursday, April 15, 2021

LDC April 2021 Newsletter

New Publications:

X-SRL: Parallel Cross-lingual Semantic Role Labeling
TAC KBP English Sentiment Slot Filling – Comprehensive Training and Evaluation Data 2013-2014
_____________________________________________________________________________

New Publications:

(1) X-SRL: Parallel Cross-lingual Semantic Role Labeling was developed by Heidelberg University, Department of Computational Linguistics and the Leibniz Institute for the German Language (IDS). It consists of approximately three million words of German, French and Spanish annotated for semantic role labeling. The texts are translations of the English portion of 2009 CoNLL Shared Task Part 2 (LDC2012T04). All sentences have annotations for verbal predicates and share the original English Propbank label set across the four languages.

The 2009 CoNLL Shared Task developed syntactic dependency annotations, including the semantic dependency model roles of both verbal and nominal predicates. The following English data was used in the shared task:

Treebank-2 (LDC95T7): over one million words of annotated English newswire and other text developed by the University of Pennsylvania
Proposition Bank I (LDC2004T14): semantic annotation of newswire text from Treebank-2 developed by the University of Pennsylvania
NomBank v 1.0 (LDC2008T23): argument structure for instances of common nouns in Treebank-2 and Treebank-3 (LDC99T42), developed by New York University

For X-SRL, the English source data was automatically translated using DeepL. Automatic tokenization, lemmatization, part-of-speech tagging and syntactic parsing were then applied to the text. The data was divided into train, development and test partitions. Semantic labels were transferred for the train and development sections, and the test sentences were validated for translation quality, alignment, label transfer, and filtering.

X-SRL: Parallel Cross-lingual Semantic Role Labeling is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) TAC KBP English Sentiment Slot Filling – Comprehensive Training and Evaluation Data 2013-2014 was developed by LDC and contains training and evaluation data produced in support of the 2013 and 2014 TAC KBP Sentiment Slot Filling tracks. The data in this release includes queries, manual runs (human-produced query responses), and assessment results for human- and system-produced query responses. Source data was English news and web text.

The regular English Slot Filling track involved mining information about entities from text using a specified set of "slots", or attributes. The goal of the Sentiment Slot Filling task was to evaluate the quality of detectors for positive and negative sentiment.

TAC KBP English Sentiment Slot filling – Comprehensive Training and Evaluation Data 2013-2014 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, April 15, 2020

LDC 2020 April Newsletter

New Publications:

2018 NIST Speaker Recognition Evaluation Test Set
Abstract Meaning Representation 2.0 - Four Translations
TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013
________________________________________________________________

New publications:

(1) 2018 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology) and contains approximately 396 hours of Tunisian Arabic telephone recordings and English web video speech used as development and test data in the NIST-sponsored 2018 Speaker Recognition Evaluation (SRE). This release also contains answer keys, trial and train files, development data and evaluation documentation.

The SRE task is speaker detection, that is, to determine whether a specified target speaker is speaking during a segment of speech. In addition to the traditional focus on conversational telephone speech recorded over a variety of handset types for the training and test conditions, SRE18 added VOIP (voice over IP) data and audio from video.

The telephone speech data was drawn from the Call My Net 2 (CMN2) collection conducted by LDC in Tunisia in which recruited Tunisian Arabic speakers made multiple calls to friends or relatives for conversations lasting between 8-10 minutes. The speech segments include PSTN (public switched telephone network) and VOIP data.

The English audio was sampled from amateur web videos collected by LDC as part of the Video Annotation for Speech Technology (VAST) project.

2018 NIST Speaker Recognition Evaluation Test Set is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Abstract Meaning Representation 2.0 - Four Translations was developed by researchers at the University of Edinburgh, School of Informatics and consists of Spanish, German, Italian and Chinese Mandarin translations of 5,484 test split sentences (1,371 sentences per language) from Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).

AMR Annotation Release 2.0 is a semantic treebank of over 39,000 English natural language sentences from broadcast conversations, newswire and web text. The translated data in this release was designed for use in cross-lingual parsing.

The source sentences were drawn from material collected by LDC, specifically, discussion forum text from the DARPA BOLT and DARPA DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming, Wall Street Journal text, translated Xinhua news texts, various newswire texts from NIST OpenMT evaluations and weblog data from the DARPA GALE program.

Abstract Meaning Representation 2.0 - Four Translations is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP English Temporal Slot Filling tasks in 2011 and 2013. This release includes queries, manual runs produced by LDC annotators, and the final rounds of assessment results.

The goal of the Temporal Slot Filling task was to identify and capture temporal information in text indicating when a given relation between a slot filling query entity and filler held true. This built upon the technology developed for regular Slot Filling which involved mining information about entities from text.

The corresponding source data collections of English newswire, broadcast material and web text are included in TAC KBP Comprehensive English Source Corpora 2009-2014 (LDC2018T03). The corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - is available in TAC KBP Reference Knowledge Base (LDC2014T16).

TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.