Fall 2022 LDC Data Scholarship Program
30th Anniversary Highlight: The LDC Gigawords
New publication:
HAVIC MED Novel 2 Test – Videos, Metadata and Annotation
Fall 2022 LDC Data Scholarship Program
Student applications for the Fall 2022 LDC Data Scholarship program are being accepted now through September 15, 2022. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.
30th Anniversary Highlight: The LDC Gigawords
Giga: a combining form meaning “billion,” used in the formation of compound words (Source: https://www.dictionary.com/browse/giga-)
The community has used, and continues to use, these data sets in numerous ways. Automatic text summarization is a favorite, and current work in this area applies deep learning principles (see, e.g., Gao et al. 2020, English). Gigawords are also useful for text source classification (Huang et al. 2003, Chinese), information extraction (Lan et al. 2020, Arabic), knowledge extraction and distributional semantics (Napoles et al. 2012, English) and natural language understanding (Ganitkevitch 2013, English), among other fields. Recent variations like the annotated and concretely annotated English gigawords add syntactic, semantic, and coreference annotations to this billion word text collection.
All Gigaword corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.
New publication:
HAVIC MED Novel 2 Test – Videos, Metadata and Annotation is comprised of 6,200 hours of user-generated videos with annotation and metadata developed by LDC for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos). Each event video was manually annotated with judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.
2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.