Linguistic Data Consortium: HAVIC MED

Thursday, August 18, 2022

LDC August 2022 Newsletter

Fall 2022 LDC Data Scholarship Program

30th Anniversary Highlight: The LDC Gigawords

New publication:

HAVIC MED Novel 2 Test – Videos, Metadata and Annotation

Fall 2022 LDC Data Scholarship Program

Student applications for the Fall 2022 LDC Data Scholarship program are being accepted now through September 15, 2022. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

30th Anniversary Highlight: The LDC Gigawords

Giga: a combining form meaning “billion,” used in the formation of compound words (Source: https://www.dictionary.com/browse/giga-)

LDC’s Gigaword corpora are a natural outgrowth of its vast decades-long multi-language newswire collection. Newswire data was originally collected, annotated, and distributed for use in many sponsored projects and was also released through the LDC catalog in tailored data sets. Then came the idea of making LDC’s entire newswire collection available by language with a simple, minimal markup to support a broad range of NLP/HLT tasks. The first Arabic, Chinese and English gigaword editions were released in 2003; subsequent cumulative releases through fifth editions in 2011 represent LDC’s newswire collection spanning 1994-2010 in those languages. French and Spanish gigawords were first published in 2006, culminating in the release of third editions in 2011, likewise covering newswire collected by LDC through 2010.

The community has used, and continues to use, these data sets in numerous ways. Automatic text summarization is a favorite, and current work in this area applies deep learning principles (see, e.g., Gao et al. 2020, English). Gigawords are also useful for text source classification (Huang et al. 2003, Chinese), information extraction (Lan et al. 2020, Arabic), knowledge extraction and distributional semantics (Napoles et al. 2012, English) and natural language understanding (Ganitkevitch 2013, English), among other fields. Recent variations like the annotated and concretely annotated English gigawords add syntactic, semantic, and coreference annotations to this billion word text collection.

All Gigaword corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publication:

HAVIC MED Novel 2 Test – Videos, Metadata and Annotation is comprised of 6,200 hours of user-generated videos with annotation and metadata developed by LDC for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos). Each event video was manually annotated with judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Wednesday, March 16, 2022

LDC March 2022 Newsletter

LDC data and commercial technology development

New Publications:
AttImam

HAVIC MED Novel 1 Test – Videos, Metadata and Annotation

_______________________________________________________________

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) AttImam was developed by Al-Imam Mohammad Ibn Saud Islamic University and consists of approximately 2,000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 (LDC2010T13). Attribution refers to the process of reporting or assigning an utterance to the correct speaker.

The source Arabic newswire was collected by LDC from Agence France Presse articles published in 2000. Files were annotated by native Arabic speakers and contain the following elements:

Cue: the lexical anchor that connects the source with the content.
Source: the entity or the agent that owns the content.
Content: the basic element expressing the claim or the reported news.
General Features: these can include such features as attribution style (direct or indirect), determinacy (factual or non-factual), and purpose (e.g., assertion, expression).

AttImam is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) HAVIC MED Novel 1 Test – Videos, Metadata and Annotation is comprised of 3,800 hours of user-generated videos with annotation and metadata developed by LDC for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos). Each event video was manually annotated with judgments describing its event properties and other salient features.

Background videos were labeled with topic and genre categories.

HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation is distributed via web download.