Linguistic Data Consortium: HAVIC MED Event

Monday, April 15, 2019

LDC 2019 April Newsletter

LDC at ICASSP 2019

LDC data and commercial technology development

New Publications:
BOLT Egyptian-English Word Alignment -- Discussion Forum Training
Chinese Abstract Meaning Representation 1.0
HAVIC MED Progress Test -- Videos, Metadata and Annotation ____________________________________________________________

LDC at ICASSP 2019
LDC will be exhibiting at ICASSP 2019, held this year May 12-17 in Brighton, UK. Stop by booth 5 to learn more about recent developments at the Consortium and new publications.

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) BOLT Egyptian-English Word Alignment -- Discussion Forum Training was developed by LDC and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes and is released as BOLT Arabic Discussion Forums (LDC2018T10).

The BOLT word alignment task was built on treebank annotation. Egyptian source tree tokens for word alignment were automatically extracted from tree files of BOLT Egyptian Arabic Treebank annotation on the discussion forum data. Human annotators then followed LDC guidelines to link words and phrases in Arabic to those in English.

BOLT Egyptian-English Word Alignment -- Discussion Forum Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Chinese Abstract Meaning Representation 1.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of Chinese sentences from the weblog and discussion forum portions of Chinese Treebank 8.0 (LDC2013T21). Annotations were applied to 10,149 sentences, with 176 sentences unannotated.

Abstract Meaning Representation (AMR) captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree structure. Chinese AMR is based on the annotation methodology developed for English with adaptations for handling specific Chinese phenomena. The goal of the Chinese AMR project is to create a large aligned AMR corpus, of which this data set is the first release. For more information about the project, see the Chinese AMR homepage.

Chinese Abstract Meaning Representation 1.0 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) HAVIC MED Progress Test -- Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 3,650 hours of user-generated videos with annotation and metadata.

In a collaboration with NIST (the National Institute of Standards and Technology) to advance multimodal event detection and related technologies, LDC developed a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Progress Test is a subset of that corpus, specifically, a collection of event and background videos originally released to support the 2012-2015 MED tasks.

This release consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Progress Test -- Videos, Metadata and Annotation is distributed via hard drive.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Monday, September 17, 2018

LDC 2018 September Newsletter

In this newsletter:

New Publications:

BOLT Information Retrieval Comprehensive Training and Evaluation

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation

Multi-Language Conversational Telephone Speech 2011 -- Spanish

IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a

__________________________________________________________________________

New publications:

(1) BOLT Information Retrieval Comprehensive Training and Evaluation was developed by LDC and consists of all data produced in support of the Information Retrieval (IR) task within the DARPA Broad Operational Language Translation (BOLT) Program, including annotations, source documents and scoring software.

The BOLT IR task sought to support development of systems that could take as input a natural language English query sentence, return relevant responses to that query from a large corpus of informal documents in the three BOLT languages (Arabic, Chinese, and English) and translate responses from non-English documents into English. This release contains (1) natural-language IR queries, system responses to queries, and manually-generated assessment judgments for system responses; (2) discussion forum source documents in Arabic, Chinese and English; (3) scoring software for each evaluation phase; and (4) experimental data developed in Phase 2.

BOLT Information Retrieval Comprehensive Training and Evaluation is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 53 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Event E051-E060 is a subset of that corpus, specifically, a collection of event videos for the HAVIC Project originally released to support the 2016 Multimedia Event Detection task.

The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation is distributed via web download.

(3) Multi-Language Conversational Telephone Speech 2011 -- Spanish was developed by LDC and is comprised of approximately 23 hours of telephone speech in Spanish.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Human auditors labeled the calls for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Slavic Group (LDC2016S11)
Turkish (LDC2017S09)
South Asian (LDC2017S14)
Central Asian (LDC2018S03)
Central European (LDC2018S08)

Multi-Language Conversational Telephone Speech 2011 -- Spanish is distributed via web download.

(4) IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kazakh conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Kazakh speech in this release represents that spoken in the Northeastern and Southern dialect regions of Kazakhstan. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.