Wednesday, March 15, 2023

LDC March 2023 Newsletter

LDC’s 30th anniversary year ends 

LDC data and commercial technology development


New publications:

Mixer 3 Speech

LORELEI Tamil Representative Language Pack

________________________________________________________________

LDC’s 30th anniversary year ends   

We hope you enjoyed the monthly data spotlights in celebration of LDC’s 30th anniversary year, April 2022-March 2023. We would not have achieved this milestone without the continued support and collaboration of our members, friends, and the community. We are grateful. As we enter our fourth decade, we pledge to continue to serve the community and our members by distributing high quality, diverse data and by providing top-notch member services and research program support. 

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications: 

Mixer 3 Speech contains 3,200 hours of conversational telephone speech involving 3,875 speakers, 19,595 telephone recordings and 26 distinct languages. This material was collected by LDC from 2005-2007 as part of the Mixer project, and recordings in this corpus were used in NIST Speaker Recognition Evaluation and NIST Language Recognition Evaluation corpora, including 2006 SRE and 2007 LRE.

Recordings were generated using LDC's computer telephony system. Recruited speakers were connected through a robot operator to carry on casual conversations lasting up to 10 minutes. Subjects fluent in languages other than English were asked to complete at least one non-English call. Metadata includes the number of calls per subject and language as well as speaker demographic information.

2023 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

*

LORELEI Tamil Representative Language Pack is comprised of over 41 million words of Tamil monolingual text, 680,000 words of found Tamil-English parallel text, and 226,000 Tamil words translated from English data. Approximately 78,000 words were annotated for named entities and over 24,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Wednesday, February 15, 2023

LDC February 2023 Newsletter

LDC membership discounts expire March 1 

30th Anniversary Highlight: Arabic Treebank 

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – Audio-Visual

LORELEI Tagalog Representative Language Pack

_________________________________________________________________________

LDC membership discounts expire March 1 

Time is running out to save on 2023 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.  

30th Anniversary Highlight: Arabic Treebank

The Penn/LDC Arabic Treebank (ATB) project began in 2001 with support from the DARPA TIDES program and later, the DARPA GALE and BOLT programs. The original focus was on Modern Standard Arabic (MSA), not natively spoken and not homogenously acquired across its writing and reading community. In addition to the expected issues associated with complex data annotation, LDC encountered several challenges unique to a highly inflected language with a rich history of traditional grammar. LDC relied on traditional Arabic grammar, as well as established and modern grammatical theories of MSA -- in combination with the Penn Treebank approach to syntactic annotation -- to design an annotation system for Arabic. (Maamouri, et al., 2004). LDC was innovative with respect to traditional grammar when necessary and when other syntactic approaches were found to account for the data. LDC also developed a wide-coverage MSA morphological analyzer, LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01), which greatly benefited ATB development. Revisions to the annotation guidelines during the DARPA GALE program (principally related to tokenization and syntactic annotation) improved inter-annotator agreement and parsing scores.

ATB corpora were annotated for morphology, part-of-speech, gloss, and syntactic structure.  Data sets based on MSA newswire developed under the revised annotation guidelines include Arabic Treebank: Part 1 v 4.1 (LDC2010T13), Arabic Treebank: Part 2 v 3.1 (LDC0211T09) and Arabic Treebank: Part 3 v 3.2 (LDC2010T08). Other genres are represented in Arabic Treebank – Broadcast News v 1.0 (LDC2012T07) and Arabic Treebank – Weblog (LDC2016T02).  

LDC’s later work on Egyptian Arabic treebanks in the DARPA BOLT program benefited from the strides in its MSA treebank annotation pipeline. As for the challenges presented by informal, dialectal material, collaborator Columbia University provided a normalized Arabic orthography to account for instances of Romanized script (Arabizi) in the data and developed a morphological analyzer (CALIMA) in parallel, working in a tight feedback loop with LDC’s annotation team.  SAMA and CALIMA were synchronized in the Egyptian Arabic treebanks, the former used for MSA tokens and the latter used for Egyptian Arabic tokens. Resulting corpora include BOLT Egyptian Arabic Treebank – Discussion Forum (LDC2018T23), Conversational Telephone Speech (LDC2021T12), and SMS/Chat (LDC2021T17).

ATB corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data.

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – Audio-Visual contains approximately 64 hours of English audio-visual data for development and test, answer keys, enrollment, trial files and documentation from the NIST-sponsored 2019 Speaker Recognition Evaluation (SRE).

The 2019 evaluation task was speaker detection, that is, to determine whether a specified target speaker was speaking during a segment of speech. The evaluation was conducted in two parts: (1) a leaderboard-style challenge based on conversational telephone speech and (2) a separate evaluation using audio-visual data. This release relates to the audio-visual evaluation. 

The source audio-visual data was collected by LDC for the VAST (Video Annotation for Speech Technology) project. That collection focused on amateur video recordings from various online media hosting services. The recordings vary in duration from 17.5 seconds to 13 minutes; most have two audio channels (stereo), but some are monophonic (one channel).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

LORELEI Tagalog Representative Language Pack was developed by LDC and is comprised of approximately 4.8 million words of Tagalog monolingual text, 341,000 words of found Tagalog-English parallel text, and 124,000 Tagalog words translated from English data. Approximately 78,000 words were annotated for named entities and over 26,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, January 19, 2023

LDC January 2023 Newsletter

Renew your LDC membership today 

30th Anniversary Highlight: CSR  

New publications:

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts

LORELEI Swahili Representative Language Pack

_______________________________________________________________________

Renew your LDC membership today 
The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 925+ holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2023, 2022 members receive a 10% discount on 2023 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 

30th Anniversary Highlight: CSR  
The CSR (continuous speech recognition) corpus series was developed in the early 1990s under DARPA’s Spoken Language Program to support research on large-vocabulary CSR systems. 

CSR-I (WSJ0) Complete (LDC93S6A) and CSR-II (WSJ1) Complete (LDC94S13A) contain speech from a machine-readable corpus of Wall Street Journal news text. They also include spontaneous dictation by journalists of hypothetical news articles as well as transcripts.

The text in CSR-I (WSJ0) was selected to fall within either a 5,000-word subset or a 20,000-word subset. Audio includes speaker-dependent and speaker-independent sections as well as sentences with verbalized and nonverbalized punctuation. (Doddington, 1992). CSR-II features “Hub and Spoke” test sets that include a 5,000-word subset and a 64,000-word subset. Both data sets were collected using two microphones – a close-talking Sennheiser HMD414 and a second microphone of varying type. 

WSJ0 Cambridge Read News (LDC95S24) was developed by Cambridge University and consists of native British English speakers reading CSR WSJ news text, specifically, sentences from the 5,000-word and 64,000-word subsets. All speakers also recorded a common set of 18 adaptation sentences.  

The CSR corpora continue to have value for the research community. CSR-I (WSJ0) target utterances were used in the CHiME2 and CHiME3 challenges which focused on distant-microphone automatic speech recognition in real-world environments. CHiME2 WSJ0 (LDC2017S10) and CHiME2 Grid (LDC2017S07) each contain over 120 hours of English speech from a noisy living room environment. CHiME3 (LDC2017S24) consists of 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. 

CSR-I target utterances were also used in the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. DIRHA English WSJ Audio (LDC2018S01) is comprised of approximately 85 hours of real and simulated read speech from native American English speakers in an apartment setting with typical domestic background noises and inter/intra-room reverberation effects.

Multi-Channel WSJ Audio (LDC2014S03), designed to address the challenges of speech recognition in meetings, contains 100 hours of audio from British English speakers reading sentences from WSJ0 Cambridge Read News. There were three recording scenarios: a single stationary speaker, two stationary overlapping speakers, and one single moving speaker. 

All CSR corpora and their related data sets are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publications:
 
AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts and is comprised of approximately 156 hours of Ukrainian conversational telephone speech and broadcast news audio with 1.2 million words of corresponding orthographic transcripts. 
 
The news audio data was taken from 87 recordings broadcast by various Ukrainian sources. The telephone speech was generated from telephone calls by native Ukrainian speakers to acquaintances in their social network. Native Ukrainian speakers manually segmented the data into sentence-level units as part of the transcription process.
 
The broadcast recordings and transcripts were produced by LDC to support the DARPA AIDA (Active Interpretation of Disparate Alternatives) program which aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. The telephone speech audio recordings were collected by LDC to support the NIST 2011 Language Recognition Evaluation  and are also contained in Multi-Language Conversational Telephone Speech 2011 – Slavic Group LDC2016S11.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

LORELEI Swahili Representative Language Pack was developed by LDC and is comprised of approximately 4.3 million words of Swahili monolingual text, 90,000 Swahili words translated from English data, and 545,000 words of found Swahili-English parallel text. Approximately 100,000 words were annotated for named entities and up to 26,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.