Linguistic Data Consortium: LDC June 2026 Newsletter

Maintaining LDC organization user accounts

LDC data and commercial technology development

New publications:

KAIROS Phase 1 Evaluation Source Data, Annotation, and Assessment

Multi-Language Conversational Telephone Speech 2014 – Spanish & Portuguese

________________________________________________________________

Maintaining LDC organization user accounts
LDC encourages organization account administrators to review their LDC organization user accounts at least annually to remove users who are no longer affiliated with the organization. Users no longer affiliated with an organization cannot continue to access LDC data through the organization’s LDC account. As stated in LDC’s membership agreements and license agreements, LDC data cannot be shared outside the member/licensing organization. LDC reserves the right to deactivate user accounts if any suspicious activity is detected. Visit the User Accounts page for further information on user types and privileges.

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

KAIROS Phase 1 Evaluation Source Data, Annotation, and Assessment was developed by LDC and contains the English and Spanish source data (text, video, images), manual annotations, reference knowledge graphs, the system output assessed during the evaluation, and human assessment results from the Phase 1 evaluation of the DARPA KAIROS Program. The Phase 1 evaluation focused on the improvised explosive bombing scenario with nine complex events and two surprise complex events in the mass shooting scenario.

Source data for each complex event consisted of 10-15 documents that included multimodal English and Spanish event-relevant and off-topic distractor documents. Manual annotation and assessment of event-relevant documents for 10 complex events are included in this release. Scenario-relevant events and relations were labeled for each document to develop a structured representation of temporally-ordered events, relations and arguments that expressed the scenario-relevant events in each complex event. A reference knowledge graph (Graph G) was developed for each event; systems were expected to match the Graph G with a given schema library. Assessment data includes human assessment judgments and the system output that was manually assessed for the end-to-end evaluation task.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Multi-Language Conversational Telephone Speech 2014 – Spanish & Portuguese was developed by LDC and is comprised of 123 hours of Spanish and Portuguese telephone speech. The data was collected to support research and technology evaluation in automatic language identification; portions of these recordings were used in the NIST 2015 and 2017 language recognition evaluations. The collection focused on language pair discrimination for 20 languages/dialects, some of which could be considered mutually intelligible or closely related.

This corpus contains 569 recordings covering Brazilian Portuguese, Caribbean Spanish, European Spanish and Latin American Spanish. Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 8 minutes, to each acquaintance. Human auditors labeled the calls for language, quality, callee gender, dialect type and noise.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Multiway Translated Text consists of a fixed set of English texts (around 100,000 words) translated into 24 languages. It was developed by LDC for the DARPA LORELEI Program; the translations were included in the LORELEI representative language packs created by LDC in 2016-2019.

The common word set was composed of English news documents (50%), LORELEI-domain English news documents (25%), and a phrasebook and elicitation corpus (25%). The phrasebook contained everyday colloquial phrases. The elicitation corpus was designed to represent linguistic structures. Texts were translated by a combination of professional translators and crowd-sourced translators.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, June 15, 2026

LDC June 2026 Newsletter

No comments:

Post a Comment