Linguistic Data Consortium: LDC April 2025 Newsletter

Tuesday, April 15, 2025

LDC April 2025 Newsletter

LDC launches upgraded, mobile-friendly website

Connect with LDC on Bluesky

New publications:
DEFT Spanish Light and Rich ERE Annotation

____________________________________________________________

LDC launches upgraded, mobile-friendly website
We are pleased to announce the launch of the newly upgraded LDC main website: https://www.ldc.upenn.edu/. Designed with a modern layout, the site now offers an improved experience across all devices. While the LDC Catalog, LDC user accounts, and LDC Submissions are not affected by this upgrade, they are now more accessible than ever from any page on the site. We invite you to explore the website and enjoy a smoother, more intuitive LDC web experience.

Connect with LDC on Bluesky
In addition to Facebook, X and LinkedIn, you can now connect with LDC on the microblogging platform, Bluesky. Follow us today to learn the latest news, announcements and corpora releases from the Consortium.

New publications:

DEFT Spanish Light and Rich ERE Annotation was developed by LDC and consists of 158 Spanish discussion forum and newswire documents annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation, and event types between and among those entities including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation. The source data consists of Spanish newswire text and Latin American discussion forum data from DEFT Spanish Treebank LDC2018T01. 128 documents were annotated following Light ERE annotation guidelines. 154 files were labeled with Rich ERE annotation, 124 of which were also labeled with Light ERE annotation.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

MATERIAL Kazakh-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 17% of the speech files, all of which were translated into English. This release also includes English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Linguistic Data Consortium

Tuesday, April 15, 2025

LDC April 2025 Newsletter

No comments:

Post a Comment