In this newsletter:
New Publications:
__________________________________________________________________________
New publications:
(1) BOLT
Information Retrieval Comprehensive Training and Evaluation was developed
by LDC and consists of all data produced in support of the Information
Retrieval (IR)
task within the DARPA Broad Operational Language Translation (BOLT) Program,
including annotations, source documents and scoring software.
The BOLT IR
task sought to support development of
systems that could take as input a natural language English query sentence,
return relevant responses to that query from a large corpus of informal
documents in the three BOLT languages (Arabic, Chinese, and English) and
translate responses from non-English documents into English. This release contains (1)
natural-language IR queries, system responses to queries, and
manually-generated assessment judgments for system responses; (2) discussion
forum source documents in Arabic, Chinese and English; (3) scoring software for
each evaluation phase; and (4) experimental data developed in Phase 2.
BOLT Information Retrieval Comprehensive Training and
Evaluation is distributed via web download.
2018 Subscription Members will automatically receive copies
of this corpus. 2018 Standard Members may request a copy as part of their 16
free membership corpora. Non-members may license this data for a fee.
*
(2) HAVIC MED
Event E051-E060 -- Videos, Metadata and Annotation was developed by
LDC and is comprised of approximately 53 hours of user-generated videos with
annotation and metadata. To advance multimodal event detection and related
technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and
Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC
(the Heterogeneous Audio Visual Internet Collection) that was used in the
NIST-sponsored MED
(Multimedia Event Detection) task for several years. HAVIC MED Event E051-E060
is a subset of that corpus, specifically, a collection of event videos for the
HAVIC Project originally released to support the 2016 Multimedia
Event Detection task.
The data consists of videos of various events (event videos)
and videos completely unrelated to events (background videos) harvested by a
large team of human annotators. Each event video was manually annotated with a
set of judgments describing its event properties and other salient features.
Background videos were labeled with topic and genre categories.
HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation
is distributed via web download.
2018 Subscription Members will automatically receive copies
of this corpus. 2018 Standard Members may request a copy as part of their 16
free membership corpora. Non-members may license this data for a fee.
*
(3) Multi-Language Conversational Telephone Speech
2011 -- Spanish was developed by LDC and is comprised of
approximately 23 hours of telephone speech in Spanish.
The data were collected primarily to support research and
technology evaluation in automatic language identification, and portions of
these telephone calls were used in the NIST 2011 Language Recognition
Evaluation (LRE).
Participants were recruited by native speakers who contacted acquaintances in
their social network. Those native speakers made one call, up to 15 minutes, to
each acquaintance. Human auditors labeled the calls for callee gender, dialect
type, and noise.
LDC has also released the following as part of the
Multi-Language Conversational Telephone Speech 2011 series:
- Slavic Group (LDC2016S11)
- Turkish (LDC2017S09)
- South Asian (LDC2017S14)
- Central Asian (LDC2018S03)
- Central European (LDC2018S08)
Multi-Language Conversational Telephone Speech 2011 -- Spanish
is distributed via web download.
2018 Subscription Members will automatically receive copies
of this corpus. 2018 Standard Members may request a copy as part of their 16
free membership corpora. Non-members may license this data for a fee.
*
(4) IARPA Babel Kazakh Language Pack
IARPA-babel302b-v1.0a was developed by Appen for the IARPA
(Intelligence Advanced Research Projects Activity) Babel program.
It contains approximately 203 hours of Kazakh conversational and scripted
telephone speech collected in 2013 and 2014 along with corresponding
transcripts.
The Kazakh speech in this release represents that spoken in
the Northeastern and Southern dialect regions of Kazakhstan. The gender
distribution among speakers is approximately equal; speakers' ages range from
16 years to 64 years. Calls were made using different telephones (e.g., mobile,
landline) from a variety of environments including the street, a home or
office, a public place, and inside a vehicle.
IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a is
available via web download.
2018 Subscription Members will receive copies of this corpus
provided they have submitted a completed copy of the special license agreement.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.