Linguistic Data Consortium: October 2017

LDC Awards Fall Data Scholarships

Membership Year 2018 Publication Preview

New Publications:RATS Keyword Spotting

English Web Treebank Propbank

Ancient Chinese Corpus

MWE-Aware English Dependency Corpus Version 2.0 _________________________________________________________________________

LDC Awards Fall Data Scholarships

LDC is pleased to award fifteen data scholarships to students this fall. Recipients are from eight countries and a variety of academic disciplines. Twenty unique data sets are awarded to the students for their work in diverse applications including machine translation, abstractive text summarization using recurrent neural networks, speech recognition for multiple languages, semantic role labeling for social data, text summarization, speaker recognition for forensic applications, and more. Please look to LDC’s social media pages for upcoming announcements highlighting each recipient and their intended research. Congratulations to all of our recipients!

Membership Year 2018 Publication Preview

The 2018 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:

Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
DEFT: Spanish Treebank (newswire, web data)
RATS Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
German children’s handwriting (longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns)

Check your inbox in the coming weeks for more information about membership renewal.

New publications:

(1) RATS Keyword Spotting was developed by LDC and is comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts, and keywords generated from transcript content. The corpus was created to provide training, development, and initial test sets for the keyword spotting (KWS) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic and Farsi speakers; and (2) material from Levantine Arabic QT Training Data Set 5, Speech (LDC2006S29) and CALLFRIEND Farsi Second Edition Speech (LDC2014S01). Transcripts of calls were either produced or available from the source corpora. Potential target keywords were selected from the transcripts based on word frequencies to fall within a range of target-word likelihood per hour of speech. The selected words were manually reviewed to confirm that each was a regular or multi-word expression of more than three syllables.

RATS Keyword Spotting is distributed via hard drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) English Web Treebank Propbank was developed by University of Colorado Boulder - CLEAR (Computational Language and Education Research) and provides predicate-argument structure annotation for English Web Treebank (LDC2012T13).

The goal of Propbank (or proposition bank) annotation is to develop annotations with information about basic semantic propositions. English Web Treebank Propbank provides semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses, and all nouns considered to be predicative. Mark-up is in the "unified" propbank annotation format, which combines representations in nouns, verbs, and adjectives. The source data consists of weblogs, newsgroups, email, reviews, and questions-answers.

English Web Treebank Propbank is distributed via Web Download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). This release is part of a continuing project to develop a large, part-of-speech tagged ancient Chinese corpus. It consists of 180,000 Chinese characters and 195,000 segment units (including words and punctuation). The part-of-speech tag set was developed by Nanjing Normal University and contains 17 tags. The files are presented in UTF-8 plain text files using traditional Chinese script.

Ancient Chinese Corpus is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from OntoNotes Release 5.0 (LDC2013T19).

Version 2.0 adds annotations of named entities (persons, locations, organizations) into dependency trees that are aware of compound function words. Version 1.0 is available from LDC as MWE-Aware English Dependency Corpus (LDC2017T01).

MWEs (multiword expressions) were identified in OntoNotes' phrase structure trees and each MWE was established as a single subtree. Those phrase structure subtrees were then converted to a dependency structure (the Stanford dependencies) in CoNLL format. The data is split into 1,728 phrase structure trees as *.parse files and a single 14-column tab separated dependency as a *.conll file. Both file types are encoded as UTF-8.

MWE-Aware English Dependency Corpus Version 2.0 is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Wednesday, October 18, 2017

LDC October 2017 Newsletter