Linguistic Data Consortium: LDC May 2017 Newsletter

In this newsletter:

Recent Collaborations

New publications:

IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a

Multi-Language Conversational Telephone Speech 2011 -- Turkish

Phrase Detectives Corpus

The EventStatus Corpus

______________________________________________________________

Recent Collaborations

Collaborations play an important role in many LDC activities. Over the past twenty-five years, LDC has partnered, consulted, and otherwise “collaborated” with a variety of organizations to advance research community goals. Recently, LDC partnered with Oxford Wave Research to integrate its latest speech technology into data collection and annotation processes. LDC also supports the Hearables Challenge sponsored by the National Science Foundation by creating and distributing training and test corpora. Finally, LDC Executive Director Chris Cieri is working with international colleagues to plan LREC2018 as a member of the Conference Programme Committee.

LDC welcomes new collaborations. Let us know what interests you and how we can work together. Contact LDC to begin the conversation.

New publications:

(1) IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Lao conversational and scripted telephone speech collected in 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Lao speech in this release represents that spoken in the Vientiane dialect region in Laos. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Multi-Language Conversational Telephone Speech 2011 -- Turkish was developed by LDC and is comprised of approximately 18 hours of telephone speech in Turkish. The data was collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE).

Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. Demographic information about the participants was not collected.

LDC has also released the Multi-Language Conversation Telephone Speech 2011 -- Slavic Group (LDC2016S11)

Multi-Language Conversational Telephone Speech 2011 -- Turkish is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Phrase Detectives Corpus was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" designed to collect data about English anaphoric coreference.

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. Annotations are comprised of a gold standard version created by multiple experts, as well as a set created by a large non-expert crowd (via the Phase Detectives game).

The data was annotated according to a prevalent linguistically-oriented approach for anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01) and The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Phrase Detectives Corpus is distributed via web download.

(4) The EventStatus Corpus was developed by researchers at Texas A&M University, Stanford University and The University of Utah. It consists of approximately 3,000 English and 1,500 Spanish news articles about civil unrest events annotated with temporal tags.

This corpus was designed to support the study of the temporal and aspectual properties of major events, that is, whether an event has already happened, is currently happening or may happen in the future. Since it focuses on a single domain (civil unrest events), it may be appropriate for tasks such as event extraction and temporal question answering.

The relevant news articles were sourced from English Gigaword Fifth Edition (LDC2017T09) and Spanish Gigaword Third Edition (LDC2011T12). The civil unrest events include protests, demonstrations, marches and strikes.

The EventStatus Corpus is distributed via web download.

Linguistic Data Consortium

Monday, May 15, 2017

LDC May 2017 Newsletter

1 comment: