LDC data and commercial technology development
New publications:
AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment
LORELEI Hindi Representative Language Pack
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
New publications:
Mixer 7 English Speech was developed by LDC and contains 12,321 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 222 distinct English speakers. This material was collected by LDC in 2010-2011 as part of the Mixer project, and the recordings were used in the 2012 NIST SRE test set.
Recruited speakers were connected through a robot operator to carry on casual conversations on a pre-set topic lasting up to 10 minutes. Participants also visited LDC’s human subjects collection lab equipped with a 14-microphone array where they participated in interviews and transcript readings and conducted telephone calls under varying conditions. Selected speaker metadata was also collected.
2025 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.
The corpus contains 10,522 documents, annotations for 386 of those documents, and assessment results covering 77,965 responses in 1,525 of those documents. Annotations were performed in three steps: (1) within-document labels for scenario-related entities, relations and events; (2) coreference annotation across documents by linking information elements to a knowledge base; and (3) indications of any relationship between labeled events/relations and hypotheses about the scenario. In the assessment phase, LDC annotators reviewed and judged system response files to provide evaluation organizers with a means for scoring submissions. Assessment tasks included zero-hop assessment, class-based assessment, graph assessment and hypothesis assessment.
The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.