Linguistic Data Consortium: 2024

Monday, December 16, 2024

LDC December 2024 Newsletter

LDC 2025 membership discounts now available

Approaching deadline for Spring 2025 data scholarship applications

LDC closed for Winter Break December 25-January 1

New publications:

MATERIAL Farsi-English Language Pack
Abstract Meaning Representation 3.0 – Machine Translations

LDC 2025 membership discounts now available
Now through March 3, 2025, current 2024 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching deadline for Spring 2025 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2025 data scholarships are due January 15, 2025. For more information on requirements and program rules, see LDC Data Scholarships.

LDC closed for Winter Break December 25-January 1
LDC will be closed from Wednesday, December 25, 2024, through Wednesday, January 1, 2025, in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2025. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:
MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, and approximately 3% of the speech files were translated into English. This release also includes English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Abstract Meaning Representation 3.0 - Machine Translations was developed by the Center for Computational Linguistics at KU Leuven in the HORIZON2020 project SignON. It is an automatic translation of a subset of sentences from Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02) into Spanish, Irish Gaelic, and Dutch.

AMR 3.0 training, development, and test splits were translated using Google Translate. "Unsplit" directories were not translated and are not included in this release. Translations were not manually verified, but formal issues (such as unexpected new lines) were corrected, and special tokens and encoding issues were fixed with the Python tool ftfy.fix_text.

AMR 3.0 is a semantic treebank of over 59,000 English natural language sentences drawn from material collected by LDC, specifically, discussion forum text from the DARPA BOLT and DARPA DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming, Wall Street Journal text, translated Xinhua news texts, various newswire texts from NIST OpenMT evaluations, and weblog data from the DARPA GALE program.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Sunday, November 17, 2024

LDC November 2024 Newsletter

Join LDC for membership year 2025

Spring 2025 data scholarship application deadline

New publications:

LORELEI Yoruba Representative Language Pack

Samrómur Synthetic

____________________________________________________________________

Join LDC for membership year 2025
It’s time to renew your LDC membership for 2025. Current (2024) members who renew their membership before March 3, 2025 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 3.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 950+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for next year’s publications are in progress. Among the expected releases are:

Iraqi Arabic – English Lexical Database: a set of six interrelated tables (roots, lemmas, wordforms, multi-word expressions, English definitions, example phrases) presenting each Iraqi Arabic word in Arabic script and IPA format, a result of LDC’s collaboration with Georgetown University Press to enhance and update three dialectal Arabic dictionaries

AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, English, Spanish) for information and entity extraction

2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of conversational telephone speech and broadcast narrow band speech in six linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) representing 20 languages, used in NIST’s 2015 language recognition evaluation

BOLT CALLFRIEND CALLHOME CTS audio, transcripts and translations: previously unpublished Chinese and Egyptian Arabic telephone conversations from the CALLFRIEND and CALLHOME collections, with transcripts and translations developed by LDC for the DARPA BOLT program

Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from ancient and modern Chinese texts with syntactic annotation based on sentence constituent analysis, developed by Beijing Normal University and Peking University

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Georgian, Kazakh, Lithuanian)

LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali)

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2025 data scholarship application deadline
Applications are now being accepted through January 15, 2025 for the Spring 2025 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

New publications:

LORELEI Yoruba Representative Language Pack was developed by LDC and is comprised of approximately 7.2 million words of Yoruba monolingual text, 127,000 Yoruba words translated from English data, and 810,000 words of Yoruba-English parallel text. Approximately 77,000 words were annotated for named entities, over 25,000 words were annotated for full entity (including nominals and pronouns) and simple semantic annotation, and around 10,000 words were annotated for noun phrase chunking. Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Samrómur Synthetic was developed by the Language and Voice Lab, Reykjavik University and contains 72 hours of Icelandic synthetic speech, transcripts and metadata. Source sentences were extracted from the Samrómur platform, comprised of texts and transcripts covering various genres. Text was processed through a text-to-speech system developed by Reykjavik University's Language and Voice Lab to generate speech files. Synthesized speech was created with 44 voices (22 male, 22 female) at four different speed rates for a total of 220 speakers and 62,700 utterances (with 285 sentences/speaker).

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Tuesday, October 15, 2024

LDC October 2024 Newsletter

LDC/Penn receives US Dept of Education research grant

Membership year 2025 publication preview

Fall 2024 data scholarship recipients

New publications:

RST Continuity Corpus

MultiTACRED

__________________________________________________________________

LDC/Penn receives US Dept of Education research grant
LDC and Penn’s Graduate School of Education and Department of Computer and Information Science are part of a team that was recently awarded a $10 million grant from the US Department of Education to develop the Using Generative Artificial Intelligence for Reading R&D Center (U-GAIN Reading) which will explore using generative AI to improve elementary school reading instruction for English learners. Led by the education nonprofit Digital Promise, U-GAIN Reading will build on an existing research-based tutoring platform, Amira Learning, that is used by more than 1 million students each year. The LDC/Penn team will contribute expertise in computational linguistics, computer science, and learning analytics. An evaluation team at MDRC will measure learner outcomes both to improve the R&D and to benchmark its eventual impacts. Additional experts in the science of reading, ethics, and strategies for national impact will support the project’s work. Data developed in the project will be shared with the community through the LDC Catalog.

Membership year 2025 publication preview
The 2025 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

Iraqi Arabic – English Lexical Database: a set of six interrelated tables (roots, lemmas, wordforms, multi-word expressions, English definitions, example phrases) presenting each Iraqi Arabic word in Arabic script and IPA format, a result of LDC’s collaboration with Georgetown University Press to enhance and update three dialectal Arabic dictionaries
AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, English, Spanish) for information and entity extraction
2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of conversational telephone speech and broadcast narrow band speech in six linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) representing 20 languages, used in NIST’s 2015 language recognition evaluation
BOLT CALLFRIEND CALLHOME CTS Audio, Transcripts and Translations: previously unpublished Chinese and Egyptian Arabic telephone conversations from the CALLFRIEND and CALLHOME collections, with transcripts and translations developed by LDC for the DARPA BOLT program
Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from ancient and modern Chinese texts with syntactic annotation based on sentence constituent analysis, developed by Beijing Normal University and Peking University
IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Georgian, Kazakh, Lithuanian)
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali)

Check your inbox for more information about membership renewal.

Fall 2024 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2024 data scholarships:

Yomma Gamaleldin: Alexandria University (Egypt): Master’s student, Computer and Systems Engineering Department. Yomma is awarded a copy of Qatari Corpus of Argumentative Writing LDC2022T04 for her work in Arabic automated essay scoring.

Arhane Mahaganapathy: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Ahrane is awarded copies of IARPA Babel Tamil Language Pack LDC2017S13 and Multi-Language Telephone Speech 2011 – South Asian LDC2017S14 for her work in Tamil speech-to-text systems.

Sivashanth Suthakar: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Sivashanth is awarded copies of CAMIO Transcription Languages LDC2022T07 and LORELEI Tamil Representative Language Pack LDC2023T03 for his work in Tamil OCR systems.

Oshan Yalegama: University of Moratuwa (Sri Lanka): BSc, Electronic and Telecommunication Engineering. Oshan is awarded copies of CSR-I (WSJ0) Complete LDC93S6A and TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 for his work in audio signal processing.

Samer Mohammed Yaseen: Sana’a University (Yemen): PhD candidate, Faculty of Computer and Information Technology. Samer is awarded a copy of Arabic Newswire Part 1 LDC2001T55 for his work in Arabic information retrieval.

New publications:

RST Continuity Corpus was developed at Åbo Akademi University and Humboldt-Universität zu Berlin and contains annotations for continuity dimensions added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank is a collection of English news texts from the Penn Treebank annotated for rhetorical relations under the RST (Rhetorical Structure Theory) framework. In RST Continuity Corpus, the relations are annotated for the seven continuity dimensions: time, space, reference, action, perspective, modality, and speech act. The relations are also annotated for polarity, order of segments, nuclearity, and context.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity annotations. TACRED is a large-scale relation extraction dataset containing 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The training and evaluation data for the TAC KBP slot filling tasks was developed by the Linguistic Data Consortium.

TACRED training, development and test splits were translated into Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, Spanish, and Turkish using DeepL or Google Translate. The test split was back-translated into English to generate machine-translated English test data.

TACRED annotations are specified by token offsets. For translation, tokens were concatenated with white space, and the entity offsets were converted into XML-style markers to denote argument.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, September 16, 2024

LDC September 2024 Newsletter

LDC data and commercial technology development

New publications:

L2-KSU Native and Non-Native Arabic Speech

MATERIAL Somali-English Language Pack

_____________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

L2-KSU Native and Non-Native Arabic Speech was developed by King Saud University (KSU) and contains approximately six hours of Modern Standard Arabic read speech from 80 subjects, along with transcripts and speaker metadata.

The speech data was collected in 2022 from 40 native and 40 non-native speakers. Native speakers were from Saudi Arabia, Egypt, and Palestine and provided audio recordings through the crowd sourcing platform Khamsat. Non-native speakers were Central and West African students enrolled in KSU's Arabic Linguistics Institute; they provided speech recordings on site. All subjects read a series of ten sentences, repeating each sentence multiple times.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

MATERIAL Somali-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains 80 hours of Somali conversational telephone speech, transcripts, English translations, annotations and queries.

Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 10% of the speech files, and approximately 4% of the speech files were translated into English. This release also includes domain annotations, English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.