In this newsletter:
Recent Collaborations
New publications:
______________________________________________________________
Recent Collaborations
Collaborations play an important role in many LDC
activities. Over the past twenty-five years, LDC has partnered, consulted, and
otherwise “collaborated” with a variety of organizations to advance research
community goals. Recently, LDC partnered with Oxford Wave Research to integrate its latest speech
technology into data collection and annotation processes. LDC also supports the
Hearables Challenge sponsored by the National Science
Foundation by creating and distributing training and test corpora. Finally, LDC
Executive Director Chris Cieri is working with international colleagues to plan
LREC2018 as a member of the Conference Programme Committee.
LDC welcomes new collaborations. Let us know what interests
you and how we can work together. Contact LDC
to begin the conversation.
New publications:
(1)
IARPA Babel Lao
Language Pack IARPA-babel203b-v3.1a was developed by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program. It
contains approximately 207 hours of
Lao conversational and scripted telephone speech collected in 2013 along with
corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Lao speech in this release represents that spoken in the Vientiane
dialect region in Laos. The
gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60
years. Calls were made using different telephones (e.g., mobile,
landline) from a variety of environments including the street, a home or
office, a public place, and inside a vehicle.
IARPA
Babel Lao Language Pack IARPA-babel203b-v3.1a is distributed via web
download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(2) Multi-Language
Conversational Telephone Speech 2011 -- Turkish was developed by LDC
and is comprised of approximately 18 hours of telephone speech in Turkish. The
data was collected primarily to support research and technology evaluation in
automatic language identification, and portions of these telephone calls were
used in the NIST 2011 Language Recognition Evaluation (LRE).
Participants were recruited by native speakers who contacted
acquaintances in their social network. Those native speakers made one call, up
to 15 minutes, to each acquaintance. The data was collected using LDC's
telephone collection infrastructure, comprised of three computer telephony
systems. Human auditors labeled calls for callee gender, dialect type and
noise. Demographic information about the participants was not collected.
LDC has also released the Multi-Language Conversation
Telephone Speech 2011 -- Slavic Group (LDC2016S11)
Multi-Language Conversational Telephone Speech 2011 --
Turkish is distributed via web download.
2017 Subscription Members will
automatically receive copies of this corpus. 2017 Standard Members may request
a copy as part of their 16 free membership corpora. Non-members may license
this data for a fee.
*
(3) Phrase
Detectives Corpus was developed by the School
of Computer Science and Electronic Engineering at the University of Essex and
consists of approximately 19,012 words across 40 documents
anaphorically-annotated by the Phrase Detectives game,
an online interactive "game-with-a-purpose" designed to collect data
about English anaphoric coreference.
The documents in the corpus are taken from Wikipedia articles
and from narrative text in Project
Gutenberg. Annotations are comprised of a gold standard version created by
multiple experts, as well as a set created by a large non-expert crowd (via the
Phase Detectives game).
The data was annotated according to a prevalent
linguistically-oriented approach for anaphora used in several tasks, including
OntoNotes Release 5.0 (LDC2013T19),
SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple
Languages (LDC2011T01)
and The ARRAU Corpus of Anaphoric Information (LDC2013T22).
Phrase Detectives Corpus is distributed via web download.
2017 Subscription Members will
automatically receive copies of this corpus. 2017 Standard Members may request
a copy as part of their 16 free membership corpora. Non-members may license
this data at no cost.
*
(4) The
EventStatus Corpus was developed by researchers at Texas A&M University, Stanford University and The University of Utah. It consists of
approximately 3,000 English and 1,500 Spanish news articles about civil unrest
events annotated with temporal tags.
This corpus was designed to support the study of the
temporal and aspectual properties of major events, that is, whether an event
has already happened, is currently happening or may happen in the future. Since
it focuses on a single domain (civil unrest events), it may be appropriate for
tasks such as event extraction and temporal question answering.
The relevant news articles were sourced from English
Gigaword Fifth Edition (LDC2017T09)
and Spanish Gigaword Third Edition (LDC2011T12). The civil
unrest events include protests, demonstrations, marches and strikes.
The EventStatus Corpus is distributed via web download.
2017 Subscription Members will
automatically receive copies of this corpus. 2017 Standard Members may request
a copy as part of their 16 free membership corpora. Non-members may license
this data for a fee.