Wednesday, January 15, 2025

LDC January 2025 Newsletter

Renew your LDC membership today 

New publications:

Iraqi Arabic – English Lexical Database

LORELEI Hungarian Representative Language Pack

__________________________________________________________________________

Renew your LDC membership today
The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 960+ holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 3, 2025, 2024 members receive a 10% discount on 2025 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 

New publications

Iraqi Arabic - English Lexical Database was developed by LDC. It has six interrelated tables presenting over 67,000 Iraqi Arabic words as orthographic forms in Arabic script and pronunciation forms in IPA format, along with more than 120,000 English tokens.

This release is the result of a collaboration with Georgetown University Press to enhance and update three dialectal Arabic dictionaries -- Iraqi, Moroccan and Syrian -- originally published in the 1960s. The Georgetown Dictionary of Iraqi Arabic was published in 2013. That work was based on, and expanded, two dictionaries, A Dictionary of Iraqi Arabic: English-Arabic (Clarity, Stowasser and Wolfe, eds., 2003) and A Dictionary of Iraqi Arabic: Arabic-English (Woodhead and Beene, eds., 2003).

The several enhancements developed by LDC in the updated and enhanced dictionary and the lexical database included facilitating comparisons across Arabic dialects and Modern Standard Arabic by providing Arabic script spellings and IPA pronunciations to Iraqi words and phrases; promoting ease of use by language learners and researchers by developing reasonable orthographic conventions for applying the Arabic alphabet to the dialect; and facilitating a user's understanding of morphological and lexical relations by adding information on the linguistic structures of Iraqi Arabic.

The documentation accompanying this release includes instructions for combining into one database the tables in this corpus with the tables in Moroccan Arabic - English Lexical Database LDC2023L01.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

*

LORELEI Hungarian Representative Language Pack is comprised of over 686 million words of Hungarian monolingual text, 165,000 words of which were translated into English, 2.3 million words of found Hungarian-English parallel text, and 87,000 Hungarian words translated from English data. Approximately 72,500 words were annotated for named entities and over 25,000 words were annotated for full entity (including nominals and pronouns), entity linking and situation frames (identifying entities, needs and issues); over 17,000 words have simple semantic annotation; and close to 10,000 words were annotated for noun phrase chunking. Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, December 16, 2024

LDC December 2024 Newsletter

LDC 2025 membership discounts now available 

Approaching deadline for Spring 2025 data scholarship applications

LDC closed for Winter Break December 25-January 1 

New publications:


LDC 2025 membership discounts now available 
Now through March 3, 2025, current 2024 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

Approaching deadline for Spring 2025 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2025 data scholarships are due January 15, 2025. For more information on requirements and program rules, see LDC Data Scholarships

LDC closed for Winter Break December 25-January 1 
LDC will be closed from Wednesday, December 25, 2024, through Wednesday, January 1, 2025, in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2025. Requests received by the Membership Office during Winter Break will be processed when the office reopens. 


New publications:
MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, and approximately 3% of the speech files were translated into English. This release also includes English queries and their relevance annotations. 

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.


Abstract Meaning Representation  3.0 - Machine Translations was developed by the Center for Computational Linguistics at KU Leuven in the HORIZON2020 project SignON. It is an automatic translation of a subset of sentences from Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02) into Spanish, Irish Gaelic, and Dutch.

 

AMR 3.0 training, development, and test splits were translated using Google Translate. "Unsplit" directories were not translated and are not included in this release. Translations were not manually verified, but formal issues (such as unexpected new lines) were corrected, and special tokens and encoding issues were fixed with the Python tool ftfy.fix_text.


AMR 3.0 is a semantic treebank of over 59,000 English natural language sentences drawn from material collected by LDC, specifically, discussion forum text from the DARPA BOLT and DARPA DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming, Wall Street Journal text, translated Xinhua news texts, various newswire texts from NIST OpenMT evaluations, and weblog data from the DARPA GALE program.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Sunday, November 17, 2024

LDC November 2024 Newsletter

Join LDC for membership year 2025  

Spring 2025 data scholarship application deadline  

New publications:

LORELEI Yoruba Representative Language Pack

Samrómur Synthetic

____________________________________________________________________

Join LDC for membership year 2025 
It’s time to renew your LDC membership for 2025. Current (2024) members who renew their membership before March 3, 2025 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 3.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 950+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for next year’s publications are in progress. Among the expected releases are:  

  • AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, English, Spanish) for information and entity extraction 
  • 2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of conversational telephone speech and broadcast narrow band speech in six linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) representing 20 languages, used in NIST’s 2015 language recognition evaluation 
  • BOLT CALLFRIEND CALLHOME CTS audio, transcripts and translations: previously unpublished Chinese and Egyptian Arabic telephone conversations from the CALLFRIEND and CALLHOME collections, with transcripts and translations developed by LDC for the DARPA BOLT program 
  • Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from ancient and modern Chinese texts with syntactic annotation based on sentence constituent analysis, developed by Beijing Normal University and Peking University  
  • IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Georgian, Kazakh, Lithuanian)
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali) 
For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2025 data scholarship application deadline
Applications are now being accepted through January 15, 2025 for the Spring 2025 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.
 
New publications:

LORELEI Yoruba Representative Language Pack was developed by LDC and is comprised of approximately 7.2 million words of Yoruba monolingual text, 127,000 Yoruba words translated from English data, and 810,000 words of Yoruba-English parallel text. Approximately 77,000 words were annotated for named entities, over 25,000 words were annotated for full entity (including nominals and pronouns) and simple semantic annotation, and around 10,000 words were annotated for noun phrase chunking. Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.


Samrómur Synthetic was developed by the Language and Voice Lab, Reykjavik University and contains 72 hours of Icelandic synthetic speech, transcripts and metadata. Source sentences were extracted from the Samrómur platform, comprised of texts and transcripts covering various genres. Text was processed through a text-to-speech system developed by Reykjavik University's Language and Voice Lab to generate speech files. Synthesized speech was created with 44 voices (22 male, 22 female) at four different speed rates for a total of 220 speakers and 62,700 utterances (with 285 sentences/speaker).

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Tuesday, October 15, 2024

LDC October 2024 Newsletter

LDC/Penn receives US Dept of Education research grant 

Membership year 2025 publication preview 

Fall 2024 data scholarship recipients 

New publications:

RST Continuity Corpus

MultiTACRED

__________________________________________________________________

LDC/Penn receives US Dept of Education research grant 
LDC and Penn’s Graduate School of Education and Department of Computer and Information Science are part of a team that was recently awarded a $10 million grant from the US Department of Education to develop the Using Generative Artificial Intelligence for Reading R&D Center (U-GAIN Reading) which will explore using generative AI to improve elementary school reading instruction for English learners. Led by the education nonprofit Digital Promise, U-GAIN Reading will build on an existing research-based tutoring platform, Amira Learning, that is used by more than 1 million students each year. The LDC/Penn team will contribute expertise in computational linguistics, computer science, and learning analytics. An evaluation team at MDRC will measure learner outcomes both to improve the R&D and to benchmark its eventual impacts. Additional experts in the science of reading, ethics, and strategies for national impact will support the project’s work. Data developed in the project will be shared with the community through the LDC Catalog.

Membership year 2025 publication preview 
The 2025 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:  

Check your inbox for more information about membership renewal.

Fall 2024 data scholarship recipients 
Congratulations to the recipients of LDC's Fall 2024 data scholarships:

Yomma Gamaleldin: Alexandria University (Egypt): Master’s student, Computer and Systems Engineering Department. Yomma is awarded a copy of Qatari Corpus of Argumentative Writing LDC2022T04 for her work in Arabic automated essay scoring.

Arhane Mahaganapathy: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Ahrane is awarded copies of IARPA Babel Tamil Language Pack LDC2017S13 and Multi-Language Telephone Speech 2011 – South Asian LDC2017S14 for her work in Tamil speech-to-text systems.

Sivashanth Suthakar: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Sivashanth is awarded copies of CAMIO Transcription Languages LDC2022T07 and LORELEI Tamil Representative Language Pack LDC2023T03 for his work in Tamil OCR systems.

Oshan Yalegama: University of Moratuwa (Sri Lanka): BSc, Electronic and Telecommunication Engineering. Oshan is awarded copies of CSR-I (WSJ0) Complete LDC93S6A and TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 for his work in audio signal processing.

Samer Mohammed Yaseen: Sana’a University (Yemen): PhD candidate, Faculty of Computer and Information Technology. Samer is awarded a copy of Arabic Newswire Part 1 LDC2001T55 for his work in Arabic information retrieval. 

New publications:

RST Continuity Corpus was developed at Ã…bo Akademi University and Humboldt-Universität zu Berlin and contains annotations for continuity dimensions added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank is a collection of English news texts from the Penn Treebank annotated for rhetorical relations under the RST (Rhetorical Structure Theory) framework. In RST Continuity Corpus, the relations are annotated for the seven continuity dimensions: time, space, reference, action, perspective, modality, and speech act. The relations are also annotated for polarity, order of segments, nuclearity, and context.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity annotations. TACRED is a large-scale relation extraction dataset containing 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The training and evaluation data for the TAC KBP slot filling tasks was developed by the Linguistic Data Consortium.

TACRED training, development and test splits were translated into Arabic, Chinese, Finnish, French, German, Hindi,  Hungarian, Japanese, Polish, Russian, Spanish, and Turkish using  DeepL or Google Translate. The test split was back-translated into English to generate machine-translated English test data.

TACRED annotations are specified by token offsets. For translation, tokens were concatenated with white space, and the entity offsets were converted into XML-style markers to denote argument.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.