Linguistic Data Consortium

Thursday, June 15, 2023

LDC June 2023 Newsletter

LDC at ACL 2023

LDC data and commercial technology development

New publications:

Moroccan Arabic – English Lexical Database

LORELEI Indonesian Representative Language Pack

_________________________________________________________________

LDC at ACL 2023

LDC will be exhibiting at ACL 2023, held this year July 9-14 in Toronto, Canada. Stop by our booth to learn more about recent developments at the Consortium and the latest publications. LDC will post conference updates via Twitter and Facebook. We look forward to seeing you there!

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Moroccan Arabic - English Lexical Database was developed by LDC. It contains a set of five interrelated tables presenting each Moroccan Arabic word as an orthographic form in Arabic script and a pronunciation form in International Phonetic Alphabet (IPA) format. This release contains over 21,000 Moroccan Arabic words in Arabic script and IPA notation and more than 33,000 English tokens.

This lexical database is the result of a collaboration with Georgetown University Press (GUP) to enhance and update three dialectal Arabic dictionaries -- Iraqi, Moroccan and Syrian -- originally published in paper form in the 1960s by GUP. LDC also undertook to develop a lexical database for each dialect. The Georgetown Dictionary of Moroccan Arabic was published in 2019; this work was based on, and expanded, A Dictionary of Moroccan Arabic.

The several enhancements developed by LDC included facilitating comparisons across Arabic dialects and Modern Standard Arabic by providing Arabic script spellings and IPA pronunciations to Moroccan words and phrases; promoting ease of use by language learners and researchers by developing reasonable orthographic conventions for applying the Arabic alphabet to the dialect; and facilitating a user's understanding of morphological and lexical relations by adding information on the linguistic structures of Moroccan Arabic.

2023 members can access this corpus through their LDC accounts provided they have submitted a signed copy of the special license agreement. Non-members may license this data for a fee.

LORELEI Indonesian Representative Language Pack is comprised of over 17 million words of Indonesian monolingual text, 950,000 million words of found Indonesian-English parallel text, and 92,000 Indonesian words translated from English data. Over 113,000 words were annotated for named entities and more than 24,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, May 15, 2023

LDC May 2023 Newsletter

LDC at ICASSP 2023

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – CTS Challenge

LORELEI Zulu Representative Language Pack
_____________________________________________________________
LDC at ICASSP 2023

LDC will be exhibiting at ICASSP 2023, held this year June 4-10 in Rhodes, Greece. Stop by booth 15 to learn more about recent developments at the Consortium and the latest publications.

LDC will post conference updates via Twitter and Facebook. We look forward to seeing you there!

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – CTS Challenge, developed by LDC and NIST, contains 635 hours of Tunisian Arabic telephone recordings for development and test, answer keys, enrollment, trial files and documentation from the CTS Challenge portion of the NIST-sponsored 2019 Speaker Recognition Evaluation. The 2019 evaluation was conducted in two parts: (1) a leaderboard-style challenge based on conversational telephone speech from LDC's Call My Net 2 (CMN2) corpus; and (2) a separate evaluation using audio-visual material collected by LDC for the VAST (Video Annotation for Speech Technology) project (released as LDC2023V01).

The telephone speech data for the CTS Challenge was drawn from the CMN2 collection conducted by LDC in Tunisia in which Tunisian Arabic speakers called friends or relatives who agreed to record their telephone conversations lasting between 8-10 minutes. The speech segments include PSTN (public switched telephone network) and VOIP (voice over IP) data.
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Zulu Representative Language Pack is comprised of over 5 million words of Zulu monolingual text, 2.7 million words of found Zulu-English parallel text, and 71,000 Zulu words translated from English data. Approximately 100,000 words were annotated for named entities and over 23,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, April 17, 2023

LDC April 2023 Newsletter

In memoriam: Christopher Cieri 1963-2023

New publications:

Penn Korean Universal Dependency Treebank

DEFT English Light and Rich ERE Annotation

______________________________________________________________

In memoriam: Christopher Cieri 1963-2023

With deep sadness, LDC announces the passing of Christopher Cieri, our Executive Director. Chris led the Consortium for over 25 years, guiding its evolution from a small data repository and research hub to a prominent global data center.

An accomplished linguist and computer scientist and a well-read humanist, Chris embodied the best qualities for executing the wide range of duties demanded by his leadership role. He was a valued colleague and friend and will be sorely missed.

All are welcome to visit our remembrance page for Chris.

New publications:

Penn Korean Universal Dependency Treebank contains 5010 sentences and 132,041 tokens annotated in dependency format under the Universal Dependencies framework. It is a conversion of Korean Treebank Annotations Version 2.0 (LDC2006T09) which was produced in constituency format.

The source text is newswire stories from LDC’s Korean Press Agency collection contained in Korean Newswire (LDC2000T45). Sentences were automatically converted for dependency annotation; the output was manually checked. The corpus contains 112 files in CoNLL-U format, the Universal Dependencies standard, with a mapping to their counterpart in LDC2006T09.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

DEFT English Light and Rich ERE Annotation was developed by LDC and consists of 1190 English discussion forum, newswire and proxy documents annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation, and event types between and among those entities, including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation.

902 documents were annotated following Light ERE annotation guidelines. 288 documents were labeled with Rich ERE annotation in a second pass after being annotated for Light ERE. The source data consists of English discussion forum web text collected by LDC for the DARPA BOLT program and contained in BOLT English Discussion Forums (LDC2017T11); newswire documents published in various data sets released in the TAC KBP project (Text Analysis Conference Knowledge Base Population); and proxy documents intended to mimic government analysis reports of newswire content published in DEFT Narrative Text (LDC2016T07).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Wednesday, March 15, 2023

LDC March 2023 Newsletter

LDC’s 30th anniversary year ends

LDC data and commercial technology development

New publications:

Mixer 3 Speech

LORELEI Tamil Representative Language Pack

________________________________________________________________

LDC’s 30th anniversary year ends

We hope you enjoyed the monthly data spotlights in celebration of LDC’s 30th anniversary year, April 2022-March 2023. We would not have achieved this milestone without the continued support and collaboration of our members, friends, and the community. We are grateful. As we enter our fourth decade, we pledge to continue to serve the community and our members by distributing high quality, diverse data and by providing top-notch member services and research program support.

LDC data and commercial technology development

New publications:

Mixer 3 Speech contains 3,200 hours of conversational telephone speech involving 3,875 speakers, 19,595 telephone recordings and 26 distinct languages. This material was collected by LDC from 2005-2007 as part of the Mixer project, and recordings in this corpus were used in NIST Speaker Recognition Evaluation and NIST Language Recognition Evaluation corpora, including 2006 SRE and 2007 LRE.

Recordings were generated using LDC's computer telephony system. Recruited speakers were connected through a robot operator to carry on casual conversations lasting up to 10 minutes. Subjects fluent in languages other than English were asked to complete at least one non-English call. Metadata includes the number of calls per subject and language as well as speaker demographic information.

2023 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

LORELEI Tamil Representative Language Pack is comprised of over 41 million words of Tamil monolingual text, 680,000 words of found Tamil-English parallel text, and 226,000 Tamil words translated from English data. Approximately 78,000 words were annotated for named entities and over 24,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Wednesday, February 15, 2023

LDC February 2023 Newsletter

LDC membership discounts expire March 1

30th Anniversary Highlight: Arabic Treebank

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – Audio-Visual

LORELEI Tagalog Representative Language Pack

_________________________________________________________________________

LDC membership discounts expire March 1

Time is running out to save on 2023 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

30th Anniversary Highlight: Arabic Treebank

The Penn/LDC Arabic Treebank (ATB) project began in 2001 with support from the DARPA TIDES program and later, the DARPA GALE and BOLT programs. The original focus was on Modern Standard Arabic (MSA), not natively spoken and not homogenously acquired across its writing and reading community. In addition to the expected issues associated with complex data annotation, LDC encountered several challenges unique to a highly inflected language with a rich history of traditional grammar. LDC relied on traditional Arabic grammar, as well as established and modern grammatical theories of MSA -- in combination with the Penn Treebank approach to syntactic annotation -- to design an annotation system for Arabic. (Maamouri, et al., 2004). LDC was innovative with respect to traditional grammar when necessary and when other syntactic approaches were found to account for the data. LDC also developed a wide-coverage MSA morphological analyzer, LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01), which greatly benefited ATB development. Revisions to the annotation guidelines during the DARPA GALE program (principally related to tokenization and syntactic annotation) improved inter-annotator agreement and parsing scores.

ATB corpora were annotated for morphology, part-of-speech, gloss, and syntactic structure. Data sets based on MSA newswire developed under the revised annotation guidelines include Arabic Treebank: Part 1 v 4.1 (LDC2010T13), Arabic Treebank: Part 2 v 3.1 (LDC0211T09) and Arabic Treebank: Part 3 v 3.2 (LDC2010T08). Other genres are represented in Arabic Treebank – Broadcast News v 1.0 (LDC2012T07) and Arabic Treebank – Weblog (LDC2016T02).

LDC’s later work on Egyptian Arabic treebanks in the DARPA BOLT program benefited from the strides in its MSA treebank annotation pipeline. As for the challenges presented by informal, dialectal material, collaborator Columbia University provided a normalized Arabic orthography to account for instances of Romanized script (Arabizi) in the data and developed a morphological analyzer (CALIMA) in parallel, working in a tight feedback loop with LDC’s annotation team. SAMA and CALIMA were synchronized in the Egyptian Arabic treebanks, the former used for MSA tokens and the latter used for Egyptian Arabic tokens. Resulting corpora include BOLT Egyptian Arabic Treebank – Discussion Forum (LDC2018T23), Conversational Telephone Speech (LDC2021T12), and SMS/Chat (LDC2021T17).

ATB corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data.

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – Audio-Visual contains approximately 64 hours of English audio-visual data for development and test, answer keys, enrollment, trial files and documentation from the NIST-sponsored 2019 Speaker Recognition Evaluation (SRE).

The 2019 evaluation task was speaker detection, that is, to determine whether a specified target speaker was speaking during a segment of speech. The evaluation was conducted in two parts: (1) a leaderboard-style challenge based on conversational telephone speech and (2) a separate evaluation using audio-visual data. This release relates to the audio-visual evaluation.

The source audio-visual data was collected by LDC for the VAST (Video Annotation for Speech Technology) project. That collection focused on amateur video recordings from various online media hosting services. The recordings vary in duration from 17.5 seconds to 13 minutes; most have two audio channels (stereo), but some are monophonic (one channel).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Tagalog Representative Language Pack was developed by LDC and is comprised of approximately 4.8 million words of Tagalog monolingual text, 341,000 words of found Tagalog-English parallel text, and 124,000 Tagalog words translated from English data. Approximately 78,000 words were annotated for named entities and over 26,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, January 19, 2023

LDC January 2023 Newsletter

Renew your LDC membership today

30th Anniversary Highlight: CSR

New publications:

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts

LORELEI Swahili Representative Language Pack

_______________________________________________________________________

Renew your LDC membership today
The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 925+ holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2023, 2022 members receive a 10% discount on 2023 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits.

30th Anniversary Highlight: CSR
The CSR (continuous speech recognition) corpus series was developed in the early 1990s under DARPA’s Spoken Language Program to support research on large-vocabulary CSR systems.

CSR-I (WSJ0) Complete (LDC93S6A) and CSR-II (WSJ1) Complete (LDC94S13A) contain speech from a machine-readable corpus of Wall Street Journal news text. They also include spontaneous dictation by journalists of hypothetical news articles as well as transcripts.

The text in CSR-I (WSJ0) was selected to fall within either a 5,000-word subset or a 20,000-word subset. Audio includes speaker-dependent and speaker-independent sections as well as sentences with verbalized and nonverbalized punctuation. (Doddington, 1992). CSR-II features “Hub and Spoke” test sets that include a 5,000-word subset and a 64,000-word subset. Both data sets were collected using two microphones – a close-talking Sennheiser HMD414 and a second microphone of varying type.

WSJ0 Cambridge Read News (LDC95S24) was developed by Cambridge University and consists of native British English speakers reading CSR WSJ news text, specifically, sentences from the 5,000-word and 64,000-word subsets. All speakers also recorded a common set of 18 adaptation sentences.

The CSR corpora continue to have value for the research community. CSR-I (WSJ0) target utterances were used in the CHiME2 and CHiME3 challenges which focused on distant-microphone automatic speech recognition in real-world environments. CHiME2 WSJ0 (LDC2017S10) and CHiME2 Grid (LDC2017S07) each contain over 120 hours of English speech from a noisy living room environment. CHiME3 (LDC2017S24) consists of 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio.

CSR-I target utterances were also used in the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. DIRHA English WSJ Audio (LDC2018S01) is comprised of approximately 85 hours of real and simulated read speech from native American English speakers in an apartment setting with typical domestic background noises and inter/intra-room reverberation effects.

Multi-Channel WSJ Audio (LDC2014S03), designed to address the challenges of speech recognition in meetings, contains 100 hours of audio from British English speakers reading sentences from WSJ0 Cambridge Read News. There were three recording scenarios: a single stationary speaker, two stationary overlapping speakers, and one single moving speaker.

All CSR corpora and their related data sets are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publications:

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts and is comprised of approximately 156 hours of Ukrainian conversational telephone speech and broadcast news audio with 1.2 million words of corresponding orthographic transcripts.

The news audio data was taken from 87 recordings broadcast by various Ukrainian sources. The telephone speech was generated from telephone calls by native Ukrainian speakers to acquaintances in their social network. Native Ukrainian speakers manually segmented the data into sentence-level units as part of the transcription process.

The broadcast recordings and transcripts were produced by LDC to support the DARPA AIDA (Active Interpretation of Disparate Alternatives) program which aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. The telephone speech audio recordings were collected by LDC to support the NIST 2011 Language Recognition Evaluation and are also contained in Multi-Language Conversational Telephone Speech 2011 – Slavic Group LDC2016S11.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Swahili Representative Language Pack was developed by LDC and is comprised of approximately 4.3 million words of Swahili monolingual text, 90,000 Swahili words translated from English data, and 545,000 words of found Swahili-English parallel text. Approximately 100,000 words were annotated for named entities and up to 26,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, December 15, 2022

LDC December 2022 Newsletter

LDC 2023 membership discounts now available

Approaching deadline for Spring 2023 data scholarship applications

30th Anniversary Highlight: AMR

New publications:

CAMIO Transcription Languages

Global TIMIT Thai

Third DIHARD Challenge Evaluation

________________________________________________________________

LDC 2023 membership discounts now available

Now through March 1, 2023, current 2022 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching deadline for Spring 2023 data scholarship applications

Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2023 data scholarships are due January 15, 2023. For more information on requirements and program rules, see LDC Data Scholarships.

30th Anniversary Highlight: AMR

Abstract Meaning Representation (AMR) annotation was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It is a semantic representation language that captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

LDC’s Catalog contains three cumulative English AMR publications: Release 1.0 (LDC2014T12), Release 2.0 (LDC2017T10), and Release 3.0 (LDC2020T02). The combined result in AMR 3.0 is a semantic treebank of roughly 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text and includes multi-sentence annotations.

LDC has also published Chinese Abstract Meaning Representation 1.0 (LDC2019T07) and 2.0 (LDC2021T13) developed by Brandeis University and Nanjing Normal University. These corpora contain AMR annotations for approximately 20,000 sentences from Chinese Treebank 8.0 (LDC2013T21). Chinese AMR follows the basic principles developed for English, making adaptations were necessary to accommodate Chinese phenomena.

Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07), developed by the University of Edinburgh, School of Informatics, consists of Spanish, German, Italian and Chinese Mandarin translations of a subset of sentences from AMR 2.0.
Visit LDC’s Catalog for more details about these publications.

New publications:

CAMIO Transcription Languages was developed by LDC and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition and related technologies for 35 languages across 24 unique script types.

Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes; 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in an XML output format defined for this corpus. Data for each language is partitioned into test, train or validation sets.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Global TIMIT Thai consists of 12 hours of read speech and time-aligned transcripts in Standard Thai from 50 speakers (33 female, 17 male) reading 120 sentences selected from the Thai National Corpus, the Thai Junior Encyclopedia, and Thai Wikipedia, for a total of 6000 utterances. Data was collected in 2016. Speakers were recruited in the Bangkok metropolitan area; they were native Thais, fluent in Standard Thai, and literate.

This data set was developed as part of LDC’s Global TIMIT project which aims to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Third DIHARD Challenge Evaluation was developed by LDC and contains 33 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.

The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.