Monday, June 16, 2025

LDC June 2025 Newsletter

LDC data and commercial technology development


New publications:

Chinese Sentence Pattern Structure Treebank
IWSLT 2022-2023 Shared Task Training, Development and Test Set
KAIROS Schema Learning Complex Event Annotation

_______________________________________________________________________


LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Chinese Sentence Pattern Structure Treebank was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works. There are three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer tool which is included in the release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

IWSLT 2022 - 2023 Shared Task Training, Development and Test Set was developed by LDC and contains 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation. This material constitutes the training, development and test data used in the International Conference on Spoken Language Translation (IWSLT) Dialectal Speech Translation task (2022) and the Dialectal and Low-resource track (2023).

The telephone speech was collected by LDC in 2016-2017 from native speakers of Tunisian Arabic in Tunis. Speakers were recruited to make telephone calls to people in their social networks from a variety of noise conditions and handsets. Transcripts are orthographic following Buckwalter transliteration and cover 175 hours of the collected speech. IPA transcripts were added to a subset of the data. 

All transcribed segments were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

KAIROS Schema Learning Complex Event Annotation was developed by LDC to support the DARPA KAIROS program. It contains English and Spanish text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance. Source data was collected from the web; 3431 root web pages were collected and processed, yielding 1919 text data files, 24019 image files, 1472 video files and 16 audio files.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.


Thursday, May 15, 2025

LDC May 2025 Newsletter

New publications:

BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio

BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations

___________________________________________________________

New Publications: 

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio was developed by LDC and consists of 93 hours of speech from 236 unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. The calls were collected by LDC in the CALLFRIEND and CALLHOME series where participants called family members or close friends and spoke on topics of their choice. Around 60% of the recordings (141 calls) are publicly released for the first time. The remaining 95 recordings were previously published by LDC in various CALLFRIEND, CALLHOME and HUB5 Mandarin datasets. The data is divided into training, development, and evaluation partitions.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The material in this release represents the unannotated Chinese source conversational telephone speech. The telephone data was transcribed, translated, and annotated for various tasks in the BOLT program including word alignment, treebanking, and co-reference.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations contains transcripts and corresponding English translations for the conversational telephone speech in BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio and was developed by LDC to support the DARPA BOLT program. 

Transcribers were required to produce a verbatim transcript of all speech within a file using simplified Chinese orthography and to add minimal markup to capture salient features of the speech. Some transcripts include redactions for potential personally identifying information. All speech data was transcribed and is divided into training, development, and evaluation partitions.

The goal of the BOLT translation task was to translate the Chinese transcripts into fluent English while preserving the meaning present in the original Chinese text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. 89% of the transcripts were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. 

Tuesday, April 15, 2025

LDC April 2025 Newsletter

LDC launches upgraded, mobile-friendly website

Connect with LDC on Bluesky


New publications:
DEFT Spanish Light and Rich ERE Annotation

MATERIAL Kazakh-English Language Pack

____________________________________________________________


LDC launches upgraded, mobile-friendly website
We are pleased to announce the launch of the newly upgraded LDC main website: https://www.ldc.upenn.edu/. Designed with a modern layout, the site now offers an improved experience across all devices. While the LDC Catalog, LDC user accounts, and LDC Submissions are not affected by this upgrade, they are now more accessible than ever from any page on the site. We invite you to explore the website and enjoy a smoother, more intuitive LDC web experience. 

Connect with LDC on Bluesky
In addition to Facebook, X and LinkedIn, you can now connect with LDC on the microblogging platform, Bluesky. Follow us today to learn the latest news, announcements and corpora releases from the Consortium. 


New publications:

DEFT Spanish Light and Rich ERE Annotation was developed by LDC and consists of 158 Spanish discussion forum and newswire documents annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation, and event types between and among those entities including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation. The source data consists of Spanish newswire text and Latin American discussion forum data from DEFT Spanish Treebank LDC2018T01. 128 documents were annotated following Light ERE annotation guidelines. 154 files were labeled with Rich ERE annotation, 124 of which were also labeled with Light ERE annotation.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

MATERIAL Kazakh-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 17% of the speech files, all of which were translated into English. This release also includes English queries and their relevance annotations. 


The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Monday, March 17, 2025

LDC March 2025 Newsletter

LDC data and commercial technology development

New publications:

2015 NIST Language Recognition Evaluation Test Set

The Xi’an Multi-Language Learner Corpus

_________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

2015 NIST Language Recognition Evaluation Test Set was developed by LDC and NIST. It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation (LRE), approximately 867 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin American, Brazilian Portuguese); English (British, Indian, General American English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); and French (West African, Haitian Creole).

The CTS data includes calls between individuals in the same social networks lasting 8-15 minutes and telephone speech from the IARPA Babel series collected in 2012-2013 from speakers using a range of phone types in diverse settings with varying noise conditions. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g., call-ins to a talk show).

The goal of NIST's LRE evaluations is to establish the baseline of current performance capability for CTS language recognition and to lay the groundwork for further research efforts. LRE15 expanded the range of test segment durations and added a test condition that allowed systems to make use of unrestricted training data when developing models

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

The Xi’an Multi-Language Learner Corpus was developed by Xi'an International Studies University (XISU) and is comprised of 526 argumentative essays in 15 languages by Chinese L1 university students studying second languages, along with student metadata and writing prompts. It was developed to support second language learner research and to provide a database for cross-linguistic comparison of second languages. 

Data was collected in 2023 and 2024 from students at XISU and Yunnan Minzu University (YMU) who were linguistic majors or studying one of the foreign languages available at XISU and YMU. Off-topic essays and incomplete texts were excluded.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, February 17, 2025

LDC February 2025 Newsletter

LDC at LT4ALL 2025

LDC membership discounts expire March 3

Spring 2025 data scholarship recipients

New publications:

AIDA Scenario 3 Practice Topic Source Data and Annotation

MATERIAL Georgian-English Language Pack

______________________________________________________________________

LDC at LT4All 2025 
LDC is pleased to be a sponsor of The 2nd International Conference on Language Technologies for All (LT4All 2025), February 24-26, 2025, organized by ELRA and SIGUL, the ELRA/ISCA Special Interest Group on Under-resourced Languages, and in partnership with UNESCO as part of the International Decade of Indigenous Languages (2022-2032). The conference theme, "Advancing Humanism through Language Technologies," focuses on community empowerment within the larger discussion on the many ways technology impacts language communities. The conference will also commemorate the Silver Jubilee of International Mother Language Day (February 21).

LDC membership discounts expire March 3 
Time is running out to save on 2025 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 3 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

Spring 2025 data scholarship recipients 
Congratulations to the recipients of LDC’s Spring 2025 data scholarships:

Sair Buckle: Charles Sturt University (Australia): PhD student, AI and Cyber Futures Institute. Sair is awarded a copy of Avocado Research Email Corpus LDC2015T03 for her work in behavioral science. 

Le Phuoc Thinh Tien, Vietnam National University Ho Chi Minh City (Vietnam); Bachelor’s student, Faculty of Information Technology. Le is awarded a copy of Penn Discourse Treebank Version 3.0 LDC2019T05 for his research in natural logical reasoning. 

The next round of applications will be accepted in September 2025. For information about the program, visit the Data Scholarships page.

New publications:

AIDA Scenario 3 Practice Topic Source Data and Annotation was developed by LDC and is comprised of English, Russian and Spanish web documents (text, video, image) and annotations. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 3 scenario focused on the COVID-19 global pandemic. This corpus contains source documents and annotations for the Scenario 3 practice topics.

The corpus contains 1417 root documents; 279 documents were annotated. Annotations include:

Event, relation and entity annotation (64 documents)

Claim frame annotation: claims (true or not) relating to the COVID-19 pandemic (203 documents)

Practice topic query claim frames: example claim frames intended to be used by systems as queries to extract similar claims from additional documents (30 documents)

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*
MATERIAL Georgian-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 79 hours of Georgian conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately half of the speech files, and approximately 3% of the speech data was translated into English. This release also includes English queries and their relevance annotations. 

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.