Monday, February 17, 2025

LDC February 2025 Newsletter

LDC at LT4ALL 2025

LDC membership discounts expire March 3

Spring 2025 data scholarship recipients

New publications:

AIDA Scenario 3 Practice Topic Source Data and Annotation

MATERIAL Georgian-English Language Pack

______________________________________________________________________

LDC at LT4All 2025 
LDC is pleased to be a sponsor of The 2nd International Conference on Language Technologies for All (LT4All 2025), February 24-26, 2025, organized by ELRA and SIGUL, the ELRA/ISCA Special Interest Group on Under-resourced Languages, and in partnership with UNESCO as part of the International Decade of Indigenous Languages (2022-2032). The conference theme, "Advancing Humanism through Language Technologies," focuses on community empowerment within the larger discussion on the many ways technology impacts language communities. The conference will also commemorate the Silver Jubilee of International Mother Language Day (February 21).

LDC membership discounts expire March 3 
Time is running out to save on 2025 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 3 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

Spring 2025 data scholarship recipients 
Congratulations to the recipients of LDC’s Spring 2025 data scholarships:

Sair Buckle: Charles Sturt University (Australia): PhD student, AI and Cyber Futures Institute. Sair is awarded a copy of Avocado Research Email Corpus LDC2015T03 for her work in behavioral science. 

Le Phuoc Thinh Tien, Vietnam National University Ho Chi Minh City (Vietnam); Bachelor’s student, Faculty of Information Technology. Le is awarded a copy of Penn Discourse Treebank Version 3.0 LDC2019T05 for his research in natural logical reasoning. 

The next round of applications will be accepted in September 2025. For information about the program, visit the Data Scholarships page.

New publications:

AIDA Scenario 3 Practice Topic Source Data and Annotation was developed by LDC and is comprised of English, Russian and Spanish web documents (text, video, image) and annotations. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 3 scenario focused on the COVID-19 global pandemic. This corpus contains source documents and annotations for the Scenario 3 practice topics.

The corpus contains 1417 root documents; 279 documents were annotated. Annotations include:

Event, relation and entity annotation (64 documents)

Claim frame annotation: claims (true or not) relating to the COVID-19 pandemic (203 documents)

Practice topic query claim frames: example claim frames intended to be used by systems as queries to extract similar claims from additional documents (30 documents)

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*
MATERIAL Georgian-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 79 hours of Georgian conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately half of the speech files, and approximately 3% of the speech data was translated into English. This release also includes English queries and their relevance annotations. 

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Wednesday, January 15, 2025

LDC January 2025 Newsletter

Renew your LDC membership today 

New publications:

Iraqi Arabic – English Lexical Database

LORELEI Hungarian Representative Language Pack

__________________________________________________________________________

Renew your LDC membership today
The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 960+ holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 3, 2025, 2024 members receive a 10% discount on 2025 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 

New publications

Iraqi Arabic - English Lexical Database was developed by LDC. It has six interrelated tables presenting over 67,000 Iraqi Arabic words as orthographic forms in Arabic script and pronunciation forms in IPA format, along with more than 120,000 English tokens.

This release is the result of a collaboration with Georgetown University Press to enhance and update three dialectal Arabic dictionaries -- Iraqi, Moroccan and Syrian -- originally published in the 1960s. The Georgetown Dictionary of Iraqi Arabic was published in 2013. That work was based on, and expanded, two dictionaries, A Dictionary of Iraqi Arabic: English-Arabic (Clarity, Stowasser and Wolfe, eds., 2003) and A Dictionary of Iraqi Arabic: Arabic-English (Woodhead and Beene, eds., 2003).

The several enhancements developed by LDC in the updated and enhanced dictionary and the lexical database included facilitating comparisons across Arabic dialects and Modern Standard Arabic by providing Arabic script spellings and IPA pronunciations to Iraqi words and phrases; promoting ease of use by language learners and researchers by developing reasonable orthographic conventions for applying the Arabic alphabet to the dialect; and facilitating a user's understanding of morphological and lexical relations by adding information on the linguistic structures of Iraqi Arabic.

The documentation accompanying this release includes instructions for combining into one database the tables in this corpus with the tables in Moroccan Arabic - English Lexical Database LDC2023L01.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

*

LORELEI Hungarian Representative Language Pack is comprised of over 686 million words of Hungarian monolingual text, 165,000 words of which were translated into English, 2.3 million words of found Hungarian-English parallel text, and 87,000 Hungarian words translated from English data. Approximately 72,500 words were annotated for named entities and over 25,000 words were annotated for full entity (including nominals and pronouns), entity linking and situation frames (identifying entities, needs and issues); over 17,000 words have simple semantic annotation; and close to 10,000 words were annotated for noun phrase chunking. Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, December 16, 2024

LDC December 2024 Newsletter

LDC 2025 membership discounts now available 

Approaching deadline for Spring 2025 data scholarship applications

LDC closed for Winter Break December 25-January 1 

New publications:


LDC 2025 membership discounts now available 
Now through March 3, 2025, current 2024 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

Approaching deadline for Spring 2025 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2025 data scholarships are due January 15, 2025. For more information on requirements and program rules, see LDC Data Scholarships

LDC closed for Winter Break December 25-January 1 
LDC will be closed from Wednesday, December 25, 2024, through Wednesday, January 1, 2025, in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2025. Requests received by the Membership Office during Winter Break will be processed when the office reopens. 


New publications:
MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, and approximately 3% of the speech files were translated into English. This release also includes English queries and their relevance annotations. 

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.


Abstract Meaning Representation  3.0 - Machine Translations was developed by the Center for Computational Linguistics at KU Leuven in the HORIZON2020 project SignON. It is an automatic translation of a subset of sentences from Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02) into Spanish, Irish Gaelic, and Dutch.

 

AMR 3.0 training, development, and test splits were translated using Google Translate. "Unsplit" directories were not translated and are not included in this release. Translations were not manually verified, but formal issues (such as unexpected new lines) were corrected, and special tokens and encoding issues were fixed with the Python tool ftfy.fix_text.


AMR 3.0 is a semantic treebank of over 59,000 English natural language sentences drawn from material collected by LDC, specifically, discussion forum text from the DARPA BOLT and DARPA DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming, Wall Street Journal text, translated Xinhua news texts, various newswire texts from NIST OpenMT evaluations, and weblog data from the DARPA GALE program.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Sunday, November 17, 2024

LDC November 2024 Newsletter

Join LDC for membership year 2025  

Spring 2025 data scholarship application deadline  

New publications:

LORELEI Yoruba Representative Language Pack

Samrómur Synthetic

____________________________________________________________________

Join LDC for membership year 2025 
It’s time to renew your LDC membership for 2025. Current (2024) members who renew their membership before March 3, 2025 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 3.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 950+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for next year’s publications are in progress. Among the expected releases are:  

  • AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, English, Spanish) for information and entity extraction 
  • 2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of conversational telephone speech and broadcast narrow band speech in six linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) representing 20 languages, used in NIST’s 2015 language recognition evaluation 
  • BOLT CALLFRIEND CALLHOME CTS audio, transcripts and translations: previously unpublished Chinese and Egyptian Arabic telephone conversations from the CALLFRIEND and CALLHOME collections, with transcripts and translations developed by LDC for the DARPA BOLT program 
  • Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from ancient and modern Chinese texts with syntactic annotation based on sentence constituent analysis, developed by Beijing Normal University and Peking University  
  • IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Georgian, Kazakh, Lithuanian)
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali) 
For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2025 data scholarship application deadline
Applications are now being accepted through January 15, 2025 for the Spring 2025 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.
 
New publications:

LORELEI Yoruba Representative Language Pack was developed by LDC and is comprised of approximately 7.2 million words of Yoruba monolingual text, 127,000 Yoruba words translated from English data, and 810,000 words of Yoruba-English parallel text. Approximately 77,000 words were annotated for named entities, over 25,000 words were annotated for full entity (including nominals and pronouns) and simple semantic annotation, and around 10,000 words were annotated for noun phrase chunking. Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.


Samrómur Synthetic was developed by the Language and Voice Lab, Reykjavik University and contains 72 hours of Icelandic synthetic speech, transcripts and metadata. Source sentences were extracted from the Samrómur platform, comprised of texts and transcripts covering various genres. Text was processed through a text-to-speech system developed by Reykjavik University's Language and Voice Lab to generate speech files. Synthesized speech was created with 44 voices (22 male, 22 female) at four different speed rates for a total of 220 speakers and 62,700 utterances (with 285 sentences/speaker).

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.