Linguistic Data Consortium: July 2024

Monday, July 15, 2024

LDC July 2024 Newsletter

LDC at IC2S2

Fall 2024 LDC Data Scholarship Program

New publications:

MATERIAL Bulgarian-English Language Pack

____________________________________________________________________

LDC at IC2S2
LDC is delighted to be a bronze sponsor for the 10th International Conference on Computational Social Science (IC2S2) held this year on Penn’s campus July 17-20. The conference will feature research from around the world across a broad range of relevant fields to advance the many frontiers of computational social science. Be sure to visit LDC’s table during the poster sessions July 18 and 19 from 1:30-2:30 pm.

Fall 2024 LDC Data Scholarship Program
Student applications for the Fall 2024 LDC Data Scholarship program are being accepted now through September 15, 2024. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

MATERIAL Bulgarian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains 80 hours of Bulgarian conversational telephone speech, transcripts, English translations, annotations and queries.

Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 40% of the speech files, and approximately 10% of the speech files were translated into English. This release also includes domain annotations, English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Dialogs Re-Enacted Across Languages was developed at the University of Texas at El Paso. It contains 17 hours of conversational speech in English and Spanish by 129 unique bilingual speakers, specifically, short fragments extracted from spontaneous conversations and close re-enactments in the other language by the original speakers, for 3816 pairs of matching utterances. Data was collected in 2022-2023. Participants were recruited from among students at the University of Texas at El Paso; all were bilingual speakers of General American English and of Mexico-Texas Border Spanish.

Each speaker pair had a 10 minute conversation in one language. Various fragments from these conversations were chosen for re-enactment, and the original speakers produced equivalents in the other language. Each re-enactment was vetted for fidelity to the original and naturalness in the target language. Also included is metadata about conversations, participants, re-enactments and utterances.

2024 members can access this corpus through their LDC accounts. Non-members may license this data fora fee.