Linguistic Data Consortium: August 2024

Thursday, August 15, 2024

LDC August 2024 Newsletter

Fall 2024 LDC Data Scholarship Program

New publications:

Ravnursson Faroese Speech and Transcripts

_______________________________________________________________

Fall 2024 LDC Data Scholarship Program

Student applications for the Fall 2024 LDC Data Scholarship program are being accepted now through September 15, 2024. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

LORELEI Uyghur Incident Language Pack was developed by LDC and is comprised of 28 million words of Uyghur monolingual text, 500,000 words of English monolingual text, 3.3 million words of parallel and comparable Uyghur-English text, and 200,000 words annotated for simple named entities and situation frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Uyghur language that were used in the DARPA LORELEI / LoReHLT 2016 Evaluation.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Named entity annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Ravnursson Faroese Speech and Transcripts contains 109 hours of Faroese prompted speech from 433 speakers (249 female, 184 male), corresponding transcripts and speaker metadata. It is an extract from the Basic Language Resource Kit 1.0 (BLARK 1.0) developed by the Faroe Islands' Ravnur Project.

Speech data was collected in 2022. Speakers from all major dialect areas in the Faroe Islands in three age groups -- 15-35, 36-60, and 61+ years -- read texts that included a word list, a phrase list, closed vocabulary readings, and short texts. Recordings also contain spontaneous speech. Orthographic transcripts are included.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data at no cost.