Linguistic Data Consortium: April 2024

Monday, April 15, 2024

LDC April 2024 Newsletter

New publications:

LoReHLT Hausa Representative Language Pack

AIDA Scenario 2 Practice Topic Source Data
_______________________________________________________________

New publications:

LoReHLT Hausa Representative Language Pack was developed by LDC and is comprised of approximately 4.4 million words of Hausa monolingual text, 86,000 Hausa words translated from English data, and 30 minutes of Hausa audio recordings. Approximately 96,000 words were annotated for named entities and over 13,000 words were annotated for full entity including nominals and pronouns. Noun-phrase chunking was applied to more than 7,400 words. Over 9,600 words were labeled with simple semantic annotation. Topic annotation was applied to the audio recordings. Data was collected from discussion forum, news, reference, social network, amateur web audio recordings, and weblogs.

LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

AIDA Scenario 2 Practice Topic Source Data was developed by LDC and is comprised of 1500 root documents (text, image, and video) from English, Russian, and Spanish web sources. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 2 scenario focused on the socioeconomic and political crisis in Venezuela since 2010. This corpus constitutes the full set of topic-focused documents for Phase 2 practice subtopics.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

The knowledge base for entity detection and linking annotation for all AIDA Scenario 1 and 2 corpora is available separately as AIDA Scenario 1 and 2 Reference Knowledge Base (LDC2023T10).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.