Thursday, September 15, 2022

LDC September 2022 Newsletter

Upcoming Policy Change to LDC’s Open Memberships

LDC at Interspeech 2022

LanguageARC: Citizen Science for Language

30th Anniversary Highlight: Switchboard 

New publications:
Xi’an Guanzhong Object Naming
MASRI Synthetic
_____________________________________________________________

Upcoming Policy Change to LDC’s Open Memberships

LDC is changing its open membership year policy beginning January 1, 2023.  Only one membership year will be open for joining – the current membership year. The 2022 membership year will close for joining on December 31, 2022. We expect this change to have a minimal impact on members, while allowing us to streamline our processes to serve members better. LDC’s many membership benefits will remain the same and organizations choosing to join membership years in advance will still be able to do so. If you have any questions about this change, please don’t hesitate to contact our membership office.

LDC at Interspeech 2022
 
LDC is proud to sponsor the Workshop for Young Female Researchers in Speech (YFRSW) to be held in-person as an Interspeech 2022 pre-conference satellite event on September 17. Also, be sure to check out the collaborative work of LDC’s Mark Liberman, “The mapping between syntactic and prosodic phrasing in English and Mandarin”, presented during the On-Site Oral Session: Phonetics and Phonology on Wednesday, September 21, 13:30-15:30 KST. 

LanguageArc: Citizen Science for Language 

LanguageARC is a citizen science web portal for language research developed by LDC with the support of the National Science Foundation (grant #1730377). 

LanguageARC brings together researchers and participants from the general public interested in language to form a community dedicated to support and advance language-related research and development. Contributors to this online community can participate in a variety of language-related tasks and activities such as reading text, answering questions, describing images or video, creating or evaluating transcriptions for audio clips or developing translations into their native languages. LanguageARC includes projects in languages other than English, such as French, Sesotho and Swedish. Xi’an Guanzhong Object Naming LDC2022S09, released this month in LDC’s Catalog and described below, is an example of a data set developed using LanguageArc. New projects will be added on an ongoing basis.
 
Sign up for a LanguageARC account today to start making real contributions to language knowledge and research. Please share this information with colleagues, students and anyone who might be interested in participating in the language activities on this website. If you are a researcher interested in creating a project on Language ARC, please reach out on the site’s Contact page.
 
Find LanguageArc on Facebook at: https://www.facebook.com/languagearc

30th Anniversary Highlight: Switchboard 

Switchboard-1 Release 2 (LDC97S62) is considered the first large collection of spontaneous conversational telephone speech (Graff & Bird, 2000). It consists of approximately 260 hours of recordings collected by Texas Instruments in 1990-1991 (Godfrey et al., 1992). The first release of the corpus (later superseded) was published by NIST and distributed by LDC in 1993.

Participants were 543 speakers (302 male, 241 female) from across the United States who accounted for around 2,400 two-sided telephone conversations. A robot operator handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. Roughly 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. 

This gold standard data set has been used for many HLT applications, including speaker identification, speaker authentication, and speech recognition. It is considered one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (Deshmukh et al., 1998) as well as a key resource for studying the phonetic properties of spontaneous speech (Greenberg et al., 1996). Annotation tasks based on Switchboard include discourse tags/speech acts, part-of-speech tagging and parsing, and sentiment analysis.  

The Switchboard series includes Switchboard Credit CardPhase IIPhase III, the Switchboard Cellular collection, and new recordings from 18 Switchboard participants in the 2013 Greybeard corpus.

All Switchboard corpora are available in the Catalog for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publications:

Xi’an Guanzhong Object Naming  is comprised of 15 hours of audio recordings from speakers of the Guanzhong dialect of Mandarin Chinese living in or near Xi'an in Shaangxi Province (China) naming objects that appeared in colored line drawings. The corpus was developed to support traditional and computer aided language documentation.
 
The collection was conducted from February-May 2021 using LanguageArc, a citizen science portal developed by LDC, from a closed volunteer community. Speakers were presented with images selected from the MultiPic dataset and were asked to record themselves naming the objects in the images.
 
Xi’an Guanzhong Object Naming is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

MASRI Synthetic MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and contains 99 hours of synthesized Maltese speech. 

Source sentences were extracted from the Maltese Language Resource Server (MLRS) corpus, comprised of written or transcribed Maltese covering various genres, including parliamentary debates, news, law, opinion, sports, culture, academic, literature and religious texts. Text was processed through the CrimsonWing text-to-speech system to generate speech files. Synthesized speech was created with 210 voices (105 female, 105 male).

MASRI Synthetic is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Thursday, August 18, 2022

LDC August 2022 Newsletter

Fall 2022 LDC Data Scholarship Program

30th Anniversary Highlight: The LDC Gigawords 

New publication:

HAVIC MED Novel 2 Test – Videos, Metadata and Annotation

 

 

Fall 2022 LDC Data Scholarship Program 

Student applications for the Fall 2022 LDC Data Scholarship program are being accepted now through September 15, 2022. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

30th Anniversary Highlight: The LDC Gigawords 

Giga: a combining form meaning “billion,” used in the formation of compound words (Source: https://www.dictionary.com/browse/giga-)

LDC’s Gigaword corpora are a natural outgrowth of its vast decades-long multi-language newswire collection. Newswire data was originally collected, annotated, and distributed for use in many sponsored projects and was also released through the LDC catalog in tailored data sets. Then came the idea of making LDC’s entire newswire collection available by language with a simple, minimal markup to support a broad range of NLP/HLT tasks. The first ArabicChinese and English gigaword editions were released in 2003; subsequent cumulative releases through fifth editions in 2011 represent LDC’s newswire collection spanning 1994-2010 in those languages. French and Spanish gigawords were first published in 2006, culminating in the release of third editions in 2011, likewise covering newswire collected by LDC through 2010.

The community has used, and continues to use, these data sets in numerous ways. Automatic text summarization is a favorite, and current work in this area applies deep learning principles (see, e.g., Gao et al. 2020, English). Gigawords are also useful for text source classification (Huang et al. 2003, Chinese), information extraction (Lan et al. 2020, Arabic), knowledge extraction and distributional semantics (Napoles et al. 2012, English) and natural language understanding (Ganitkevitch 2013, English), among other fields. Recent variations like the annotated and concretely annotated English gigawords add syntactic, semantic, and coreference annotations to this billion word text collection. 

All Gigaword corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publication: 

HAVIC MED Novel 2 Test – Videos, Metadata and Annotation is comprised of 6,200 hours of user-generated videos with annotation and metadata developed by LDC for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos). Each event video was manually annotated with judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.
 
HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Sunday, July 17, 2022

LDC July 2022 Newsletter

Fall 2022 LDC Data Scholarship Program

30th Anniversary Highlight: ATIS0 Complete 

New publications:

Qatari Corpus of Argumentative Writing

Second DIHARD Challenge Evaluation - SEEDLingS

 

 

Fall 2022 LDC Data Scholarship Program 

Student applications for the Fall 2022 LDC Data Scholarship program are being accepted now through September 15, 2022. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

30th Anniversary Highlight: ATIS0 Complete 

The ATIS corpora were among the first publications that appeared with the launch of LDC’s catalog in 1993. ATIS0 Complete (LDC93S4A) is comprised of spontaneous speech, read speech and other material from participants in the ATIS collection that is contained in ATIS0 Pilot (LDC93S4B), ATIS0 Read (LDC93S4B-2) and ATIS0 SD-Read (LDC93S4B-3).

The ATIS (Air Travel Information Services) collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory for Computer Science, National Institute for Standards and Technology and SRI International.

The ATIS collection has been widely used to further research in spoken language understanding and slot filling (Kuo et al., 2020). Other data sets published from the collection include ATIS2 (LDC93S5), ATIS3 Training and Test Data (LDC94S19, LDC95S26) and, more recently, Multilingual ATIS (LDC2019T04) and ATIS - Seven Languages (LDC2021T04).

All ATIS corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publications:

(1) Qatari Corpus of Argumentative Writing  was developed by Qatar UniversityUniversity of Exeter and Hamad Bin Khalifa University and is comprised of approximately 200,000 tokens of Arabic and English writing by undergraduate students (159 female, 36 male) along with annotations and related metadata. Students were native Arabic speakers and fluent in English; each student wrote one Arabic and one English essay in response to specific argumentative prompts. They were instructed to include in their essays a clear thesis statement supported by relevant evidence.
 
The corpus is divided into Arabic and English parts, each of which contains 195 essays. Metadata includes information about the students (gender, major, first language, second language) and information about the essay texts (serial numbers of texts, word limits, genre, date of writing, time spent on writing, place of writing).

Qatari Corpus of Argumentative Writing is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

* 

(2) Second DIHARD Challenge Evaluation - SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challenge.
 
Source data is from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First and Second DIHARD Challenges.

Second DIHARD Challenge Evaluation - SEEDLingS is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

Wednesday, June 15, 2022

LDC June 2022 Newsletter

LDC at LREC 2022

LDC data and commercial technology development

30th Anniversary Highlight: TIMIT 

New publication:

Second DIHARD Challenge Evaluation - Eleven Sources

 


LDC at LREC 2022
LDC will attend the 13th Language Resource Evaluation Conference (LREC2022), hosted by ELRA, the European Language Resource Association, in Marseille, France June 20-25, 2022. Several LDC staff members will be presenting current work on topics including WeCanTalk: A New Multi-language, Multi-modal Resource for Speaker Recognition; Reflections on 30 Years of Language Resource Development and Sharing; A Study in Contradiction: Data and Annotation for AIDA Focusing on Informational Conflict in Russia-Ukraine Relations; Data Protection, Privacy and US Regulation; BeSt: The Belief and Sentiment Corpus; and more.

Stay tuned for specific announcements on LDC’s social media pages regarding presentation times and locations. Following the conference, LDC’s presented papers and posters will be available on the Papers Page.

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

30th Anniversary Highlight: TIMIT 
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is another of the classic releases in LDC’s Catalog. Designed for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems, it contains recordings of 630 American English speakers each reading 10 phonetically rich sentences, for a total of 6300 utterances comprising 2342 distinct sentences. Data collection and annotation were a joint effort by Texas Instruments, the Massachusetts Institute of Technology and SRI International, and the data release was prepared by NIST (National Institute of Standards and Technology).  

TIMIT was among the first publications that appeared with the launch of LDC’s catalog in 1993. It remains one of the Consortium’s top ten distributed corpora and may be the single most widely-used speech database. Despite its age and small size relative to modern data sets, TIMIT’s wide range of phonetically-representative inputs, its time-aligned lexical and phonemic transcripts, and its easy availability through the LDC Catalog have contributed to its widespread use and continued popularity. Thousands of researchers remember its famous first sentence: “she had your dark suit in greasy wash water all year”. 

LDC continues the TIMIT series with its Global TIMIT project which aims to create a series of corpora in a variety of languages with TIMIT-like features. (Chanchaochai et al., 2018). Data sets published from that project include: Global TIMIT Learner Treebank English, Global TIMIT Learner Simple English, Global TIMIT Mandarin Chinese – Guanzhong Dialect, and Global TIMIT Mandarin Chinese.  

The LDC Catalog features over 900 holdings in more than 90 languages and more data is added each year. All TIMIT corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information. 

New publication:
 
Second DIHARD Challenge Evaluation - Eleven Sources was developed by LDC and contains approximately 20 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge.

The DIHARD second development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and web videos. Annotations include diarization and segmentation.

Second DIHARD Challenge Evaluation - Eleven Sources is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, May 16, 2022

LDC May 2022 Newsletter

30th Anniversary Highlight: Penn Treebank 

New publications:
Samrómur Icelandic Speech 1.0
_______________________________________________________________

30th Anniversary Highlight: Penn Treebank 
LDC’s Catalog features classic corpora responsible for critical advances in human language technology that continue to influence researchers. Among them are the Penn Treebank releases, Treebank-2 (LDC96T7) and Treebank-3 (LDC99T42).

The Penn Treebank project (1989-1996) produced seven million words tagged for part-of-speech, three million words of parsed text, over two million words annotated for predicate-argument structure and 1.6 million words of transcribed speech annotated for speech disfluencies (Taylor et al., 2003). Source material represents a diverse range of data, including Wall Street Journal (WSJ) articles, the Brown Corpus and Switchboard telephone conversations. 

Penn Treebanks are used for a wide range of purposes, including the creation and training of parsers and taggers, work on machine translation and speech recognition, and research concerning joint syntactic and semantic role labeling. Their ongoing influence is evidenced by the popularity of Treebank-3 (LDC99T42), which continues to be one of LDC’s top ten most distributed corpora in the Catalog. In addition, the WSJ section has served as a model for treebanks across many languages (Nivre, 2008).

The Penn Treebank has inspired related annotation schemes, such as Proposition Bank, the Penn Discourse Treebank project, and word alignment annotation. In addition, LDC has developed revised English treebank guidelines resulting in the re-issue of the WSJ section (English News Text Treebank: Penn Treebank Revised (LDC2015T13)) and treebanked web text (e.g., English Web Treebank (LDC2012T13) and BOLT English Translation Treebank – Chinese Discussion Forum (LDC2020T09)).   

Penn Treebank corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data

New publications:
(1) NUBUC (NyU-BU contextually controlled stories Corpus) was developed by New York UniversityMax Planck Institute for Empirical Aesthetics and Boston University. It contains approximately three hours of English read speech from eight stories focused on linguistic keywords that were created specifically for this corpus, along with transcripts, syntactic annotations and corpus metadata.

Stories are centered on a protagonist and bear a similarity to a modern fairy tale. Each story consists of approximately 2,000 words organized around critical keywords matched along multiple linguistic dimensions. The story texts comprise a total of 1024 sentences and 16,472 words. Each story was read by two different voice actors, one male and one female, in a neutral American English accent. 

Recordings are 11-12 minutes in duration, for a total of about 90 minutes of continuous speech per speaker.

NUBUC is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*
(2) Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances.

Speech data was collected between October 2019 and May 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.

Samrómur Icelandic Speech 1.0 is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Friday, April 15, 2022

LDC April 2022 Newsletter

LDC Celebrates 30 Years

LDC Releases Ukrainian Data for Disaster and Refugee Relief Research

New publication:
LORELEI Wolof Representative Language Pack
_______________________________________________________________

LDC Celebrates 30 Years
April 2022 marks the beginning of LDC’s 30th year as the leader in language resource development and distribution. Founded in 1992, the Consortium has grown from a data repository to a vibrant data center that creates, shares and preserves language resources for research, education and technology development. The Catalog continues to grow, housing over 900 titles in more than 90 languages. With the support of members, licensees, sponsors and collaborators, LDC has distributed over 200,000 copies of data to more than 6,000 organizations worldwide. We are sincerely grateful to the community, and we pledge to continue the mission to provide diverse data, high-quality member services and research program support. 

Stay tuned for upcoming newsletter highlights from the last three decades! 

LDC Releases Ukrainian Data for Disaster and Refugee Relief Research
LDC is releasing Ukrainian data it developed in the DARPA AIDA program, the NIST Language Recognition Evaluation series and the DARPA LORELEI program under a special no-cost, limited license for disaster and refugee relief research. 

These resources are available in three corpora:

LDC2022E06     AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LDC2020T24     LORELEI Ukrainian Representative Language Pack
LDC2020T10     LORELEI Entity Detection and Linking Knowledge Base

For further information about these data sets and licensing terms, see Disaster and Refugee Relief Research.

New publication:

LORELEI Wolof Representative Language Pack was developed by LDC and is comprised of approximately 225,000 words of Wolof monolingual text, 115,000 Wolof words translated from English data, 15,000 words annotated for named entities and 5,000-8,000 words annotated for entity discovery and linking and situation frames. 

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

Data was collected from news, social network, weblog, discussion forum, and reference material. Entity detection and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Wolof Representative Language Pack is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, March 16, 2022

LDC March 2022 Newsletter

LDC data and commercial technology development 

New Publications:
AttImam
_______________________________________________________________
 
LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1)  AttImam was developed by Al-Imam Mohammad Ibn Saud Islamic University and consists of approximately 2,000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 (LDC2010T13). Attribution refers to the process of reporting or assigning an utterance to the correct speaker.  

The source Arabic newswire was collected by LDC from Agence France Presse articles published in 2000. Files were annotated by native Arabic speakers and contain the following elements:
  • Cue: the lexical anchor that connects the source with the content.
  • Source: the entity or the agent that owns the content.
  • Content: the basic element expressing the claim or the reported news.
  • General Features: these can include such features as attribution style (direct or indirect), determinacy (factual or non-factual), and purpose (e.g., assertion, expression).
AttImam is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2)  HAVIC MED Novel 1 Test – Videos, Metadata and Annotation is comprised of 3,800 hours of user-generated videos with annotation and metadata developed by LDC for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos). Each event video was manually annotated with judgments describing its event properties and other salient features. 

Background videos were labeled with topic and genre categories.

HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.