Linguistic Data Consortium: LDC April 2026 Newsletter

Wednesday, April 15, 2026

LDC April 2026 Newsletter

New publications:

DEFT Chinese and English Light and Rich ERE Parallel Annotation

LORELEI Somali Representative Language Pack

____________________________________________________________________

New publications:

DEFT Chinese and English Light and Rich ERE Parallel Annotation was developed by LDC and consists of 179 Chinese discussion forum documents and their English translations annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation and event types between and among those entities including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation. 179 Chinese-English document pairs were annotated following Light ERE annotation guidelines; a subset of 171 Chinese-English document pairs were also labeled with Rich ERE annotation. The source data and English translations were drawn from BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05) originally collected and translated by LDC under the DARPA BOLT program.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

MATERIAL Tagalog-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, 2% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

LORELEI Somali Representative Language Pack contains over 13 million words of Somali monolingual text, 800,00 words of which were translated into English, and 106,000 Somali words translated from English data. Approximately 73,000 words were annotated for simple named entities, around 23,000 words were annotated for full entity (including nominals and pronouns), and over 10,000 words were covered by noun phrase chunking annotation. Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Linguistic Data Consortium

Wednesday, April 15, 2026

LDC April 2026 Newsletter

No comments:

Post a Comment