LDC at ACL 2017
Fall 2017 Data
Scholarship Program
New corpora:
IARPA Babel Tamil Language Pack
IARPA-babel204b-v1.1bKSUEmotions
Metalogue Multi-Issue Bargaining Dialogue
_________________________________________________________________________
LDC at ACL 2017: July 31-August 2, Vancouver, Canada
ACL has
returned to North America and LDC is taking this opportunity to interact with
top HLT researchers gathering in Vancouver, Canada. Stop by our
exhibition table to learn more about recent developments at the Consortium and new
publications.
Fall 2017 Data
Scholarship Program
Student
applications for the Fall 2017 LDC Data Scholarship program are being accepted
now through Friday, September 15, 2017, 11:59PM EST. The LDC Data
Scholarship program provides university students with access to LDC data at no
cost. Students must complete an application which consists of a data use
proposal and letter of support from their advisor.
For more information on application requirements and program rules, please visit the LDC Data Scholarship page.
New corpora
The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.
The material in this release represents the unannotated English source data in the discussion forum genre. Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-English content. Language identification was performed on all threads in this corpus (using CLD2).
BOLT English Discussion Forums is distributed via web
download.
2017 Subscription Members will automatically receive
copies of this corpus. 2017 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) IARPA
Babel Tamil Language Pack IARPA-babel204b-v1.1b was
developed by Appen for the IARPA (Intelligence Advanced Research Projects
Activity) Babel program. It contains 200 hours of Tamil
conversational and scripted telephone speech collected in 2012 and 2013 along
with corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Tamil speech in this
release represents that spoken in the Northern, Central, Southern and Western
dialect regions of the Indian state of Tamil Nadu. The gender distribution
among speakers is approximately equal; speakers' ages range from 16 years to 65
years. Calls were made using different telephones (e.g., mobile, landline) from
a variety of environments including the street, a home or office, a public
place, and inside a vehicle.
IARPA Babel Tamil Language Pack
IARPA-babel204b-v1.1b is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(3) KSUEmotions
was developed by King Saud University (KSU)
and contains approximately five hours of emotional Modern Standard Arabic (MSA)
speech from 23 subjects. Speakers were from three countries: Yemen, Saudi
Arabia and Syria.
KSUEmotions is distributed via web download.
2017 Subscription Members will automatically receive
copies of this corpus. 2017 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.
*
The goal of the Metalogue project was to develop a dialogue system with flexible dialogue management to enable the system's behavior in setting goals, choosing strategies and monitoring various processes. Six unique subjects (undergraduates between 19 and 25 years of age) were involved in a multi-issue bargaining scenario in which a representative of a city council and a representative of small business owners negotiated the implementation of new anti-smoking regulations. The negotiation involved four issues, each with four or five options. Participants received a preference profile for each scenario and negotiated for an agreement with the highest value based on their preference information. Negotiators were not allowed to accept an agreement with a negative value or to share their preference profiles with other participants.
The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction.
Metalogue Multi-Issue Bargaining Dialogue is distributed via
web download.
2017 Subscription Members will automatically receive
copies of this corpus. 2017 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.