Linguistic Data Consortium: July 2023

Monday, July 17, 2023

LDC July 2023 Newsletter

LanguageArc featured in Babel magazine

Fall 2023 LDC Data Scholarship Program

New publications:

Mixer 7 Spanish Speech

LORELEI Thai Representative Language Pack

_________________________________________________________________

LanguageArc featured in Babel magazine

The May 2023 edition of Babel (The Language Magazine) features an article about LDC’s citizen science portal LanguageArc (Language Analysis Research Community) and the diverse projects available there that utilize a variety of novel incentives to supplement traditional methods of creating data resources. Consider LanguageArc for your next collection project. Note: a subscription is necessary to view the article.

Fall 2023 LDC Data Scholarship Program

Student applications for the Fall 2023 LDC Data Scholarship program are being accepted now through September 15, 2023. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

Mixer 7 Spanish Speech was developed by LDC and contains 9,600 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 191 distinct native Spanish speakers. This material was collected by LDC in 2011-2012 as part of the Mixer project, and the recordings were used in the 2012 NIST SRE test set.

Recruited speakers were connected through a robot operator to carry on casual conversations on a pre-set topic lasting up to 10 minutes. Participants also visited LDC’s human subjects collection lab equipped with a 14-microphone array where they participated in interviews and transcript readings and conducted up to 3 telephone calls under varying conditions. Selected speaker metadata was also collected.

2023 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

LORELEI Thai Representative Language Pack is comprised of over 39 million words of Thai monolingual text, 2.85 million words of found Thai-English parallel text, and 141,000 Thai words translated from English data. Over 186,000 words were annotated for named entities and more than 25,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.