Linguistic Data Consortium: LDC June 2025 Newsletter

Monday, June 16, 2025

LDC June 2025 Newsletter

LDC data and commercial technology development

New publications:

Chinese Sentence Pattern Structure Treebank
IWSLT 2022-2023 Shared Task Training, Development and Test Set
KAIROS Schema Learning Complex Event Annotation

_______________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Chinese Sentence Pattern Structure Treebank was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works. There are three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer tool which is included in the release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

IWSLT 2022 - 2023 Shared Task Training, Development and Test Set was developed by LDC and contains 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation. This material constitutes the training, development and test data used in the International Conference on Spoken Language Translation (IWSLT) Dialectal Speech Translation task (2022) and the Dialectal and Low-resource track (2023).

The telephone speech was collected by LDC in 2016-2017 from native speakers of Tunisian Arabic in Tunis. Speakers were recruited to make telephone calls to people in their social networks from a variety of noise conditions and handsets. Transcripts are orthographic following Buckwalter transliteration and cover 175 hours of the collected speech. IPA transcripts were added to a subset of the data.

All transcribed segments were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

KAIROS Schema Learning Complex Event Annotation was developed by LDC to support the DARPA KAIROS program. It contains English and Spanish text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance. Source data was collected from the web; 3431 root web pages were collected and processed, yielding 1919 text data files, 24019 image files, 1472 video files and 16 audio files.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, June 16, 2025

LDC June 2025 Newsletter

No comments:

Post a Comment