LDC Awards Fall Data Scholarships
Membership Year 2018 Publication Preview
New Publications:RATS Keyword Spotting
Membership Year 2018 Publication Preview
New Publications:RATS Keyword Spotting
MWE-Aware English
Dependency Corpus Version 2.0 _________________________________________________________________________
LDC Awards Fall Data Scholarships
LDC is pleased to award fifteen data scholarships to students this fall. Recipients are from eight countries and a variety of academic disciplines. Twenty unique data sets are awarded to the students for their work in diverse applications including machine translation, abstractive text summarization using recurrent neural networks, speech recognition for multiple languages, semantic role labeling for social data, text summarization, speaker recognition for forensic applications, and more. Please look to LDC’s social media pages for upcoming announcements highlighting each recipient and their intended research. Congratulations to all of our recipients! Membership Year 2018 Publication Preview
The 2018 Membership Year
is just around the corner and plans for next year’s publications are in
progress. Among the expected releases are:
- Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
- DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
- TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
- IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
- BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
- DEFT: Spanish Treebank (newswire, web data)
- RATS Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
- TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
- German children’s handwriting (longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns)
Check your inbox in the
coming weeks for more information about membership renewal.
New publications:
(1) RATS Keyword Spotting was developed by LDC and is comprised of approximately 3,100
hours of Levantine Arabic and Farsi conversational telephone speech with
automatic and manual annotation of speech segments, transcripts, and keywords
generated from transcript content. The corpus was created to provide training,
development, and initial test sets for the keyword spotting (KWS) task in the
DARPA RATS (Robust Automatic Transcription of Speech) program.
The source audio consists of
conversational telephone speech recordings collected by LDC: (1) data collected
for the RATS program from Levantine Arabic and Farsi speakers; and (2) material
from Levantine Arabic QT Training Data Set 5, Speech (LDC2006S29) and CALLFRIEND Farsi Second Edition Speech (LDC2014S01). Transcripts of calls were either produced or
available from the source corpora. Potential target keywords were selected from
the transcripts based on word frequencies to fall within a range of target-word
likelihood per hour of speech. The selected words were manually reviewed to
confirm that each was a regular or multi-word expression of more than three
syllables.
RATS Keyword Spotting is distributed via hard drive.
2017 Subscription
Members will receive copies of this corpus. 2017 Standard Members may request a
copy as part of their 16 free membership corpora. Non-members may license this
data for a fee.
*
(2) English Web Treebank Propbank
was developed by University of Colorado
Boulder - CLEAR (Computational Language and Education Research) and
provides predicate-argument structure annotation for English Web Treebank (LDC2012T13).
The goal of Propbank (or proposition bank) annotation is
to develop annotations with information about basic semantic propositions.
English Web Treebank Propbank provides semantic role annotation and predicate
sense disambiguation for roughly 50,000 predicates, corresponding to all verbs,
all adjectives in equational clauses, and all nouns considered to be
predicative. Mark-up is in the "unified" propbank annotation format,
which combines representations in nouns, verbs, and adjectives. The source data
consists of weblogs, newsgroups, email, reviews, and questions-answers.
English Web Treebank Propbank is distributed via Web
Download.
2017 Subscription Members will receive copies of this
corpus. 2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(3) Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). This release is part of a continuing project to develop a large, part-of-speech tagged ancient Chinese corpus. It consists of 180,000 Chinese characters and 195,000 segment units (including words and punctuation). The part-of-speech tag set was developed by Nanjing Normal University and contains 17 tags. The files are presented in UTF-8 plain text files using traditional Chinese script.
Ancient Chinese Corpus is distributed via web download.
2017 Subscription
Members will receive copies of this corpus. 2017 Standard Members may request a
copy as part of their 16 free membership corpora. Non-members may license this
data for a fee.
*
(4) MWE-Aware English Dependency
Corpus Version 2.0 was developed by the Nara Institute of
Science and Technology Computational Linguistics Laboratory and consists of
English compound function words annotated in dependency format. The data is
derived from OntoNotes Release 5.0 (LDC2013T19).
Version 2.0 adds annotations of named entities (persons,
locations, organizations) into dependency trees that are aware of compound
function words. Version 1.0 is available from LDC as MWE-Aware English
Dependency Corpus (LDC2017T01).
MWEs (multiword expressions) were identified in
OntoNotes' phrase structure trees and each MWE was established as a single
subtree. Those phrase structure subtrees were then converted to a dependency
structure (the Stanford
dependencies) in CoNLL
format. The data is split into 1,728 phrase structure trees as *.parse
files and a single 14-column tab separated dependency as a *.conll file. Both
file types are encoded as UTF-8.
MWE-Aware English Dependency Corpus Version 2.0 is distributed via web download.
2017 Subscription
Members will receive copies of this corpus. 2017 Standard Members may request a
copy as part of their 16 free membership corpora. Non-members may license this
data for a fee.