Linguistic Data Consortium: March 2023

Wednesday, March 15, 2023

LDC March 2023 Newsletter

LDC’s 30th anniversary year ends

LDC data and commercial technology development

New publications:

Mixer 3 Speech

LORELEI Tamil Representative Language Pack

________________________________________________________________

LDC’s 30th anniversary year ends

We hope you enjoyed the monthly data spotlights in celebration of LDC’s 30th anniversary year, April 2022-March 2023. We would not have achieved this milestone without the continued support and collaboration of our members, friends, and the community. We are grateful. As we enter our fourth decade, we pledge to continue to serve the community and our members by distributing high quality, diverse data and by providing top-notch member services and research program support.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Mixer 3 Speech contains 3,200 hours of conversational telephone speech involving 3,875 speakers, 19,595 telephone recordings and 26 distinct languages. This material was collected by LDC from 2005-2007 as part of the Mixer project, and recordings in this corpus were used in NIST Speaker Recognition Evaluation and NIST Language Recognition Evaluation corpora, including 2006 SRE and 2007 LRE.

Recordings were generated using LDC's computer telephony system. Recruited speakers were connected through a robot operator to carry on casual conversations lasting up to 10 minutes. Subjects fluent in languages other than English were asked to complete at least one non-English call. Metadata includes the number of calls per subject and language as well as speaker demographic information.

2023 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

LORELEI Tamil Representative Language Pack is comprised of over 41 million words of Tamil monolingual text, 680,000 words of found Tamil-English parallel text, and 226,000 Tamil words translated from English data. Approximately 78,000 words were annotated for named entities and over 24,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.