LDC at ACL 2017
Fall 2017 Data
Scholarship Program
New corpora:
IARPA Babel Tamil Language Pack
IARPA-babel204b-v1.1b
KSUEmotions
Metalogue Multi-Issue
Bargaining Dialogue
_________________________________________________________________________
LDC at ACL 2017: July 31-August 2, Vancouver, Canada
ACL has
returned to North America and LDC is taking this opportunity to interact with
top HLT researchers gathering in Vancouver, Canada. Stop by our
exhibition table to learn more about recent developments at the Consortium and new
publications.
Fall 2017 Data
Scholarship Program
Student
applications for the Fall 2017 LDC Data Scholarship program are being accepted
now through Friday, September 15, 2017, 11:59PM EST. The LDC Data
Scholarship program provides university students with access to LDC data at no
cost. Students must complete an application which consists of a data use
proposal and letter of support from their advisor.
New corpora
(1) BOLT English Discussion Forums was developed by
LDC and consists of 830,440 discussion forum threads in English harvested from
the Internet using a combination of manual and automatic processes.
The BOLT (Broad
Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on
user-generated content in Chinese, Egyptian Arabic and English. The collected
data was translated and annotated for various tasks including word alignment,
treebanking, propbanking and co-reference.
The material in this
release represents the unannotated English source data in the discussion forum genre.
Collection was seeded based on the results of manual data scouting by native
speaker annotators. When multiple threads from a forum were submitted, the
entire forum was automatically harvested and added to the collection. Only a
small portion of the threads included in this release were manually reviewed,
and it is expected that there may be some offensive or otherwise undesired
content as well as some threads that contain a large amount of non-English
content. Language identification was performed on all threads in this corpus
(using CLD2).
BOLT English Discussion Forums is distributed via web
download.
2017 Subscription Members will automatically receive
copies of this corpus. 2017 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) IARPA
Babel Tamil Language Pack IARPA-babel204b-v1.1b was
developed by Appen for the IARPA (Intelligence Advanced Research Projects
Activity) Babel program. It contains 200 hours of Tamil
conversational and scripted telephone speech collected in 2012 and 2013 along
with corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Tamil speech in this
release represents that spoken in the Northern, Central, Southern and Western
dialect regions of the Indian state of Tamil Nadu. The gender distribution
among speakers is approximately equal; speakers' ages range from 16 years to 65
years. Calls were made using different telephones (e.g., mobile, landline) from
a variety of environments including the street, a home or office, a public
place, and inside a vehicle.
IARPA Babel Tamil Language Pack
IARPA-babel204b-v1.1b is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(3) KSUEmotions
was developed by King Saud University (KSU)
and contains approximately five hours of emotional Modern Standard Arabic (MSA)
speech from 23 subjects. Speakers were from three countries: Yemen, Saudi
Arabia and Syria.
Subjects read MSA
sentences from newswire text in the following emotions: neutral, anger,
sadness, happiness, surprise, and interrogative (asking a question). Human
reviewers then listened to the recordings to identify the emotion they heard. Audio
was recorded in each participant's home.
KSUEmotions is distributed via web download.
2017 Subscription Members will automatically receive
copies of this corpus. 2017 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.
*
(4) Metalogue Multi-Issue Bargaining Dialogue was
developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and
Technological Development. This release consists of approximately 2.5 hours of semantically
annotated English dialogue data that includes speech and transcripts.
The goal of the
Metalogue project was to develop a dialogue system with flexible dialogue
management to enable the system's behavior in setting goals, choosing
strategies and monitoring various processes. Six unique subjects
(undergraduates between 19 and 25 years of age) were involved in a multi-issue
bargaining scenario in which a representative of a city council and a
representative of small business owners negotiated the implementation of new
anti-smoking regulations. The negotiation involved four issues, each with four
or five options. Participants received a preference profile for each scenario
and negotiated for an agreement with the highest value based on their preference
information. Negotiators were not allowed to accept an agreement with a
negative value or to share their preference profiles with other participants.
The dialogue speech
was captured with two headset microphones and saved in 16kHz, 16-bit mono
linear PCM FLAC format. Transcripts were produced semi-automatically, using an
automatic speech recognizer followed by manual correction.
Metalogue Multi-Issue Bargaining Dialogue is distributed via
web download.
2017 Subscription Members will automatically receive
copies of this corpus. 2017 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.