Linguistic Data Consortium: July 2017

LDC at ACL 2017

Fall 2017 Data Scholarship Program

New corpora:

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
KSUEmotions
Metalogue Multi-Issue Bargaining Dialogue
_________________________________________________________________________

LDC at ACL 2017: July 31-August 2, Vancouver, Canada

ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Vancouver, Canada. Stop by our exhibition table to learn more about recent developments at the Consortium and new publications.

Fall 2017 Data Scholarship Program

Student applications for the Fall 2017 LDC Data Scholarship program are being accepted now through Friday, September 15, 2017, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page.

Applicants can email their materials to the LDC Data Scholarship program.

New corpora

(1) BOLT English Discussion Forums was developed by LDC and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

The material in this release represents the unannotated English source data in the discussion forum genre. Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-English content. Language identification was performed on all threads in this corpus (using CLD2).

BOLT English Discussion Forums is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains 200 hours of Tamil conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Tamil speech in this release represents that spoken in the Northern, Central, Southern and Western dialect regions of the Indian state of Tamil Nadu. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) KSUEmotions was developed by King Saud University (KSU) and contains approximately five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects. Speakers were from three countries: Yemen, Saudi Arabia and Syria.

Subjects read MSA sentences from newswire text in the following emotions: neutral, anger, sadness, happiness, surprise, and interrogative (asking a question). Human reviewers then listened to the recordings to identify the emotion they heard. Audio was recorded in each participant's home.

KSUEmotions is distributed via web download.

(4) Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.

The goal of the Metalogue project was to develop a dialogue system with flexible dialogue management to enable the system's behavior in setting goals, choosing strategies and monitoring various processes. Six unique subjects (undergraduates between 19 and 25 years of age) were involved in a multi-issue bargaining scenario in which a representative of a city council and a representative of small business owners negotiated the implementation of new anti-smoking regulations. The negotiation involved four issues, each with four or five options. Participants received a preference profile for each scenario and negotiated for an agreement with the highest value based on their preference information. Negotiators were not allowed to accept an agreement with a negative value or to share their preference profiles with other participants.

The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction.

Metalogue Multi-Issue Bargaining Dialogue is distributed via web download.

Linguistic Data Consortium

Wednesday, July 19, 2017

LDC July 2017 Newsletter