Linguistic Data Consortium: May 2023

Monday, May 15, 2023

LDC May 2023 Newsletter

LDC at ICASSP 2023

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – CTS Challenge

LORELEI Zulu Representative Language Pack
_____________________________________________________________
LDC at ICASSP 2023

LDC will be exhibiting at ICASSP 2023, held this year June 4-10 in Rhodes, Greece. Stop by booth 15 to learn more about recent developments at the Consortium and the latest publications.

LDC will post conference updates via Twitter and Facebook. We look forward to seeing you there!

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – CTS Challenge, developed by LDC and NIST, contains 635 hours of Tunisian Arabic telephone recordings for development and test, answer keys, enrollment, trial files and documentation from the CTS Challenge portion of the NIST-sponsored 2019 Speaker Recognition Evaluation. The 2019 evaluation was conducted in two parts: (1) a leaderboard-style challenge based on conversational telephone speech from LDC's Call My Net 2 (CMN2) corpus; and (2) a separate evaluation using audio-visual material collected by LDC for the VAST (Video Annotation for Speech Technology) project (released as LDC2023V01).

The telephone speech data for the CTS Challenge was drawn from the CMN2 collection conducted by LDC in Tunisia in which Tunisian Arabic speakers called friends or relatives who agreed to record their telephone conversations lasting between 8-10 minutes. The speech segments include PSTN (public switched telephone network) and VOIP (voice over IP) data.
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Zulu Representative Language Pack is comprised of over 5 million words of Zulu monolingual text, 2.7 million words of found Zulu-English parallel text, and 71,000 Zulu words translated from English data. Approximately 100,000 words were annotated for named entities and over 23,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.