Linguistic Data Consortium: LDC August 2025 Newsletter

LDC at Interspeech 2025

Fall 2025 LDC data scholarship program

New publications:

Mixer 6 – ChiME 8 Transcribed Calls and Interviews

Abstract Meaning Representation 2.0 – Machine Translations

________________________________________________________

LDC at Interspeech 2025
LDC will be exhibiting at Interspeech 2025, held this year August 17-21 in Rotterdam, the Netherlands. Stop by our booth to say hello and learn about the latest developments at the Consortium. Also be on the lookout for the following presentations, posters and special sessions featuring LDC work:

Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech Analysis
Monday, August 18, 11:00-13:00 - Area5-Oral1 - Speech Analysis, Detection and Classification 1

Reasoning-Based Approach with Chain-of-Thought for Alzheimer’s Detection Using Speech and Large Language Models
Tuesday, August 19, 13:30-15:30 - Area1-Poster2B - Databases and Progress in Methodology

Special Session: Challenges in Speech Collection, Curation and Annotation
Wednesday, August 20, 13:30-15:30 - Area14-SS7 – Part 1
Wednesday, August 20, 16:00-18:00 - Area14-SS8 – Part 2

TELVID: A Multilingual Multi-modal Corpus for Speaker Recognition
Thursday, August 21, 13:30-15:30 - AREA4-Oral8 – Speaker Recognition

LDC also supported the Interspeech 2025 URGENT Challenge which aims to bring more attention to constructing Universal, Robust and Generalizable speech EnhancemeNT models.

LDC will post conference updates via our social media platforms. We look forward to seeing you in Rotterdam!

Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being accepted now through September 15, 2025. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

Mixer 6 - CHiME 8 Transcribed Calls and Interviews was developed for the 7th and 8th CHiME (Computational Hearing in Multisource Environments) challenges. It contains 80 hours of English interviews and telephone speech from Mixer 6 Speech (LDC2013S03) with transcripts developed for the CHiME challenges divided into training, development and test sets. This data was used in CHiME 7 Task 1 and CHiME 8 Task 1 both of which focused on transcription and segmentation across varied recording conditions such as interviews, meetings, and dinner parties, with an emphasis on generalization across recording device types and array topologies.

The data includes audio from Mixer 6 Speech recorded on 13 microphones for a total of 1063 hours (corresponding to 80 hours of speech). The development and test sets are speaker-disjoint from the training data and consist of fully transcribed, multi-microphone interviews. Each transcript segment was labeled with the speaker, the uttered text, and the start and end times in seconds for that segment.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Abstract Meaning Representation 2.0 - Machine Translations was developed at the University of Edinburgh, School of Informatics and the University of Zurich, Department of Computational Linguistics. It consists of Spanish, German, Italian and Mandarin Chinese automatic translations of the source English and professionally-translated Spanish, German, Italian and Mandarin Chinese sentences in Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07). The translations were collected through Google Translate between May 2018 and March 2024.

The source English sentences are a subset (1,371 sentences) of the sentences contained in Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10), a semantic treebank of over 39,000 English natural language sentences from broadcast conversations, newswire and web text.

Translations were from each of the five languages (English, Spanish, German, Italian and Mandarin Chinese) to the other four languages (Spanish, German, Italian and Mandarin Chinese) covering 20 language pairs. The dataset contains 1371 source sentences in each language, each with a professionally translated source sentence and multiple dated translations by Google Translate.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

KARIOS Phase 1 Quizlet was developed by LDC and contains English and Spanish text, video and image data and annotations used for pre-evaluation research and system development during Phase 1 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly defined tasks designed to explore specific evaluation objectives enabling KAIROS system developers to exercise individual system components on a small data set prior to the full program evaluation. This corpus contains the complete set of Quizlet data used in Phase 1 which focused on two real-world complex events (CEs) within the Improvised Explosive Device bombing scenario: CE1001 (2018 Caracas drone attack) and CE1002 (Utah High School backpack bombing).

Source data was collected from the web; 30 root web pages were collected and processed, yielding 29 text data files, 216 image files and 5 video files. Annotation steps included labeling scenario-relevant events and relations for each document to develop a structured representation of temporally ordered events, relations and arguments and generating a reference knowledge graph.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Linguistic Data Consortium

Friday, August 15, 2025

LDC August 2025 Newsletter

No comments:

Post a Comment