Renew your LDC membership today
30th Anniversary Highlight: CSR
New publications:
AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LORELEI Swahili Representative Language Pack
_______________________________________________________________________
Renew your LDC membership today
The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 925+ holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.
Now through March 1, 2023, 2022 members receive a 10% discount on 2023 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits.
30th Anniversary Highlight: CSR
The CSR (continuous speech recognition) corpus series was developed in the early 1990s under DARPA’s Spoken Language Program to support research on large-vocabulary CSR systems.
CSR-I (WSJ0) Complete (LDC93S6A) and CSR-II (WSJ1) Complete (LDC94S13A) contain speech from a machine-readable corpus of Wall Street Journal news text. They also include spontaneous dictation by journalists of hypothetical news articles as well as transcripts.
WSJ0 Cambridge Read News (LDC95S24) was developed by Cambridge University and consists of native British English speakers reading CSR WSJ news text, specifically, sentences from the 5,000-word and 64,000-word subsets. All speakers also recorded a common set of 18 adaptation sentences.
The CSR corpora continue to have value for the research community. CSR-I (WSJ0) target utterances were used in the CHiME2 and CHiME3 challenges which focused on distant-microphone automatic speech recognition in real-world environments. CHiME2 WSJ0 (LDC2017S10) and CHiME2 Grid (LDC2017S07) each contain over 120 hours of English speech from a noisy living room environment. CHiME3 (LDC2017S24) consists of 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio.
CSR-I target utterances were also used in the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. DIRHA English WSJ Audio (LDC2018S01) is comprised of approximately 85 hours of real and simulated read speech from native American English speakers in an apartment setting with typical domestic background noises and inter/intra-room reverberation effects.
All CSR corpora and their related data sets are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.
New publications:
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
LORELEI Swahili Representative Language Pack was developed by LDC and is comprised of approximately 4.3 million words of Swahili monolingual text, 90,000 Swahili words translated from English data, and 545,000 words of found Swahili-English parallel text. Approximately 100,000 words were annotated for named entities and up to 26,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.