Sunday, November 17, 2024

LDC November 2024 Newsletter

Join LDC for membership year 2025  

Spring 2025 data scholarship application deadline  

New publications:

LORELEI Yoruba Representative Language Pack

Samrómur Synthetic

____________________________________________________________________

Join LDC for membership year 2025 
It’s time to renew your LDC membership for 2025. Current (2024) members who renew their membership before March 3, 2025 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 3.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 950+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for next year’s publications are in progress. Among the expected releases are:  

  • AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, English, Spanish) for information and entity extraction 
  • 2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of conversational telephone speech and broadcast narrow band speech in six linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) representing 20 languages, used in NIST’s 2015 language recognition evaluation 
  • BOLT CALLFRIEND CALLHOME CTS audio, transcripts and translations: previously unpublished Chinese and Egyptian Arabic telephone conversations from the CALLFRIEND and CALLHOME collections, with transcripts and translations developed by LDC for the DARPA BOLT program 
  • Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from ancient and modern Chinese texts with syntactic annotation based on sentence constituent analysis, developed by Beijing Normal University and Peking University  
  • IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Georgian, Kazakh, Lithuanian)
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali) 
For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2025 data scholarship application deadline
Applications are now being accepted through January 15, 2025 for the Spring 2025 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.
 
New publications:

LORELEI Yoruba Representative Language Pack was developed by LDC and is comprised of approximately 7.2 million words of Yoruba monolingual text, 127,000 Yoruba words translated from English data, and 810,000 words of Yoruba-English parallel text. Approximately 77,000 words were annotated for named entities, over 25,000 words were annotated for full entity (including nominals and pronouns) and simple semantic annotation, and around 10,000 words were annotated for noun phrase chunking. Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.


Samrómur Synthetic was developed by the Language and Voice Lab, Reykjavik University and contains 72 hours of Icelandic synthetic speech, transcripts and metadata. Source sentences were extracted from the Samrómur platform, comprised of texts and transcripts covering various genres. Text was processed through a text-to-speech system developed by Reykjavik University's Language and Voice Lab to generate speech files. Synthesized speech was created with 44 voices (22 male, 22 female) at four different speed rates for a total of 220 speakers and 62,700 utterances (with 285 sentences/speaker).

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Tuesday, October 15, 2024

LDC October 2024 Newsletter

LDC/Penn receives US Dept of Education research grant 

Membership year 2025 publication preview 

Fall 2024 data scholarship recipients 

New publications:

RST Continuity Corpus

MultiTACRED

__________________________________________________________________

LDC/Penn receives US Dept of Education research grant 
LDC and Penn’s Graduate School of Education and Department of Computer and Information Science are part of a team that was recently awarded a $10 million grant from the US Department of Education to develop the Using Generative Artificial Intelligence for Reading R&D Center (U-GAIN Reading) which will explore using generative AI to improve elementary school reading instruction for English learners. Led by the education nonprofit Digital Promise, U-GAIN Reading will build on an existing research-based tutoring platform, Amira Learning, that is used by more than 1 million students each year. The LDC/Penn team will contribute expertise in computational linguistics, computer science, and learning analytics. An evaluation team at MDRC will measure learner outcomes both to improve the R&D and to benchmark its eventual impacts. Additional experts in the science of reading, ethics, and strategies for national impact will support the project’s work. Data developed in the project will be shared with the community through the LDC Catalog.

Membership year 2025 publication preview 
The 2025 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:  

Check your inbox for more information about membership renewal.

Fall 2024 data scholarship recipients 
Congratulations to the recipients of LDC's Fall 2024 data scholarships:

Yomma Gamaleldin: Alexandria University (Egypt): Master’s student, Computer and Systems Engineering Department. Yomma is awarded a copy of Qatari Corpus of Argumentative Writing LDC2022T04 for her work in Arabic automated essay scoring.

Arhane Mahaganapathy: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Ahrane is awarded copies of IARPA Babel Tamil Language Pack LDC2017S13 and Multi-Language Telephone Speech 2011 – South Asian LDC2017S14 for her work in Tamil speech-to-text systems.

Sivashanth Suthakar: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Sivashanth is awarded copies of CAMIO Transcription Languages LDC2022T07 and LORELEI Tamil Representative Language Pack LDC2023T03 for his work in Tamil OCR systems.

Oshan Yalegama: University of Moratuwa (Sri Lanka): BSc, Electronic and Telecommunication Engineering. Oshan is awarded copies of CSR-I (WSJ0) Complete LDC93S6A and TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 for his work in audio signal processing.

Samer Mohammed Yaseen: Sana’a University (Yemen): PhD candidate, Faculty of Computer and Information Technology. Samer is awarded a copy of Arabic Newswire Part 1 LDC2001T55 for his work in Arabic information retrieval. 

New publications:

RST Continuity Corpus was developed at Åbo Akademi University and Humboldt-Universität zu Berlin and contains annotations for continuity dimensions added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank is a collection of English news texts from the Penn Treebank annotated for rhetorical relations under the RST (Rhetorical Structure Theory) framework. In RST Continuity Corpus, the relations are annotated for the seven continuity dimensions: time, space, reference, action, perspective, modality, and speech act. The relations are also annotated for polarity, order of segments, nuclearity, and context.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity annotations. TACRED is a large-scale relation extraction dataset containing 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The training and evaluation data for the TAC KBP slot filling tasks was developed by the Linguistic Data Consortium.

TACRED training, development and test splits were translated into Arabic, Chinese, Finnish, French, German, Hindi,  Hungarian, Japanese, Polish, Russian, Spanish, and Turkish using  DeepL or Google Translate. The test split was back-translated into English to generate machine-translated English test data.

TACRED annotations are specified by token offsets. For translation, tokens were concatenated with white space, and the entity offsets were converted into XML-style markers to denote argument.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, September 16, 2024

LDC September 2024 Newsletter

LDC data and commercial technology development


New publications:

L2-KSU Native and Non-Native Arabic Speech

MATERIAL Somali-English Language Pack

_____________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications: 

L2-KSU Native and Non-Native Arabic Speech was developed by King Saud University (KSU) and contains approximately six hours of Modern Standard Arabic read speech from 80 subjects, along with transcripts and speaker metadata.

The speech data was collected in 2022 from 40 native and 40 non-native speakers. Native speakers were from Saudi Arabia, Egypt, and Palestine and provided audio recordings through the crowd sourcing platform Khamsat. Non-native speakers were Central and West African students enrolled in KSU's Arabic Linguistics Institute; they provided speech recordings on site. All subjects read a series of ten sentences, repeating each sentence multiple times.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

*

MATERIAL Somali-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains 80 hours of Somali conversational telephone speech, transcripts, English translations, annotations and queries.

Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 10% of the speech files, and approximately 4% of the speech files were translated into English. This release also includes domain annotations, English queries and their relevance annotations. 

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.


Thursday, August 15, 2024

LDC August 2024 Newsletter

Fall 2024 LDC Data Scholarship Program  

New publications:

LORELEI Uyghur Incident Language Pack

Ravnursson Faroese Speech and Transcripts

_______________________________________________________________

Fall 2024 LDC Data Scholarship Program  

Student applications for the Fall 2024 LDC Data Scholarship program are being accepted now through September 15, 2024. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

New publications: 

LORELEI Uyghur Incident Language Pack was developed by LDC and is comprised of 28 million words of Uyghur monolingual text, 500,000 words of English monolingual text, 3.3 million words of parallel and comparable Uyghur-English text, and 200,000 words annotated for simple named entities and situation frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Uyghur language that were used in the DARPA LORELEI / LoReHLT 2016 Evaluation.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Named entity annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

Ravnursson Faroese Speech and Transcripts contains 109 hours of Faroese prompted speech from 433 speakers (249 female, 184 male), corresponding transcripts and speaker metadata. It is an extract from the Basic Language Resource Kit 1.0 (BLARK 1.0) developed by the Faroe Islands' Ravnur Project.
 
Speech data was collected in 2022. Speakers from all major dialect areas in the Faroe Islands in three age groups -- 15-35, 36-60, and 61+ years -- read texts that included a word list, a phrase list, closed vocabulary readings, and short texts. Recordings also contain spontaneous speech. Orthographic transcripts are included. 

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data at no cost.
 

 

Monday, July 15, 2024

LDC July 2024 Newsletter

LDC at IC2S2

Fall 2024 LDC Data Scholarship Program 

New publications:

MATERIAL Bulgarian-English Language Pack

Dialogs Re-Enacted Across Languages

____________________________________________________________________

LDC at IC2S2 
LDC is delighted to be a bronze sponsor for the 10th International Conference on Computational Social Science (IC2S2) held this year on Penn’s campus July 17-20. The conference will feature research from around the world across a broad range of relevant fields to advance the many frontiers of computational social science. Be sure to visit LDC’s table during the poster sessions July 18 and 19 from 1:30-2:30 pm. 

Fall 2024 LDC Data Scholarship Program 
Student applications for the Fall 2024 LDC Data Scholarship program are being accepted now through September 15, 2024. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

New publications:

MATERIAL Bulgarian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains 80 hours of Bulgarian conversational telephone speech, transcripts, English translations, annotations and queries.

Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 40% of the speech files, and approximately 10% of the speech files were translated into English. This release also includes domain annotations, English queries and their relevance annotations. 

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

*

Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontaneous conversations and close re-enactments in the other language by the original speakers, for 3816 pairs of matching utterances. Data was collected in 2022-2023. Participants were recruited from among students at the University of Texas at El Paso; all were bilingual speakers of General American English and of Mexico-Texas Border Spanish.

Each speaker pair had a 10 minute conversation in one language. Various fragments from these conversations were chosen for re-enactment, and the original speakers produced equivalents in the other language. Each re-enactment was vetted for fidelity to the original and naturalness in the target language. Also included is metadata about conversations, participants, re-enactments and utterances.

2024 members can access this corpus through their LDC accounts. Non-members may license this data fora fee.

 

 

Monday, June 17, 2024

LDC June 2024 Newsletter

LDC data and commercial technology development

New publications:

Diaspora Tibetan Speech

AIDA Scenario 2 Practice Topic Annotation

_________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
 
Diaspora Tibetan Speech was developed at Yale University. It contains 28 hours of Tibetan elicited speech by 73 speakers from the diaspora Tibetan community in Kathmandu, Nepal, along with transcripts, elicitation materials and speaker metadata.

Recordings were collected in 2016. All speakers were adults and varied in age as well as age of diaspora. A substantial number of speakers were born in Nepal. Each speaker contributed one recording comprising a series of elicitation tasks: some demographic information; a word list and numbers; some sentences in isolation; a scripted story; and free speech based on "frog story" type illustrations.  Annotation and metadata formats include PDF and Word (some transcripts), Excel (some transcripts, speaker metadata) and Praat TextGrids (word and number lists). 

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. 

*

AIDA Scenario 2 Practice Topic Annotation was developed by LDC and is comprised of annotations for 29 English, Russian and Spanish documents (text, image and video) from AIDA Scenario 2  Practice Topic Source Data (LDC2024T04), specifically, the set of practice documents designated for annotation in Phase 2.

Annotations are presented as tab separated files in the following categories for each topic:

  • Mentions: single references in source data to a real-world entity or filler, event, or relation. 
  • Slots: pre-defined roles in an event or relation filled by an argument (entity mention).
  • Linking: entity mentions linked to entries in the knowledge base as a method of indicating the real-world entity to which an entity referred.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

 

 

Thursday, May 16, 2024

LDC May 2024 Newsletter

LDC at LREC-COLING 2024

New publications:
Call My Net 1
______________________________________________________________________

LDC at LREC-COLING 2024
LDC will be exhibiting at LREC-COLING 2024 hosted by the European Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL) May 20-25 in Turin, Italy. Stop by our table to learn more about recent developments at the Consortium and the latest publications. 

LDC staff members will also be presenting current work on topics including Spanless Event Annotation for Corpus-Wide Complex Event Understanding, Schema Learning Corpus: Data and Annotation Focused on Complex Events, and KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora. 

LDC will post conference updates via social media. We look forward to seeing you in Italy!

New publications: 

Call My Net 1 was developed by LDC and contains 364 hours of conversational telephone speech in four languages (Tagalog, Cebuano, Cantonese and Mandarin) collected in 2015 from 221 native speakers located in the Philippines and China along with metadata and speaker demographic information. Recordings and data from this collection were used to support the NIST 2016 Speaker Recognition Evaluation.

Speakers made 10 telephone calls each to people within their existing social networks, using different handsets and under a variety of noise conditions. Speakers were connected through a robot operator to carry on casual conversations on topics of their choice. All recordings were manually audited to confirm language and speaker requirements. The documentation for this release includes metadata about phone type, noise conditions and call quality. Speaker demographic information on year of birth, sex and native language is also included.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

Automatic Content Extraction for Portuguese was developed at INESC TEC - Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência and consists of automatic Brazilian Portuguese and European Portuguese translations of the English text and annotations in ACE 2005 Multilingual Training Corpus (LDC2006T06).

ACE 2005 Multilingual Training Corpus was developed by LDC to support the Automatic Contract Extraction (ACE) program, specifically, by providing training data for the 2005 technology evaluation. It contains 1,800 files of mixed genre text in Arabic, English and Chinese annotated for entities, relations and events. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form. Text genres included newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech.

For this translation, the English data was partitioned into training, development and test sets. The documents were split into sentences and each event mention was assigned to its sentence. Source sentences and their annotations were translated into Brazilian Portuguese using Google Translate and into European Portuguese using DeepL Translate. An alignment algorithm and a parallel corpus word aligner were used to handle mismatches between translated annotations and their translated sentences.
 
2024 members can access this corpus through their LDC account. Non-members may license this data for a fee.