Thursday, December 15, 2022

LDC December 2022 Newsletter

LDC 2023 membership discounts now available 

Approaching deadline for Spring 2023 data scholarship applications

30th Anniversary Highlight: AMR  

New publications:

CAMIO Transcription Languages

Global TIMIT Thai

Third DIHARD Challenge Evaluation

________________________________________________________________

LDC 2023 membership discounts now available 

Now through March 1, 2023, current 2022 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

Approaching deadline for Spring 2023 data scholarship applications

Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2023 data scholarships are due January 15, 2023. For more information on requirements and program rules, see LDC Data Scholarships

30th Anniversary Highlight: AMR  

Abstract Meaning Representation (AMR) annotation was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It is a semantic representation language that captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax. 

LDC’s Catalog contains three cumulative English AMR publications: Release 1.0 (LDC2014T12), Release 2.0 (LDC2017T10), and Release 3.0  (LDC2020T02). The combined result in AMR 3.0 is a  semantic treebank of roughly 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text and includes multi-sentence annotations. 

LDC has also published Chinese Abstract Meaning Representation 1.0 (LDC2019T07) and 2.0 (LDC2021T13) developed by Brandeis University and Nanjing Normal University. These corpora contain AMR annotations for approximately 20,000 sentences from Chinese Treebank 8.0 (LDC2013T21). Chinese AMR follows the basic principles developed for English, making adaptations were necessary to accommodate Chinese phenomena.

Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07), developed by the University of Edinburgh, School of Informatics, consists of Spanish, German, Italian and Chinese Mandarin translations of a subset of sentences from AMR 2.0.
Visit LDC’s Catalog for more details about these publications.   

 

New publications:

CAMIO Transcription Languages was developed by LDC and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition and related technologies for 35 languages across 24 unique script types. 

Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes; 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in an XML output format defined for this corpus. Data for each language is partitioned into test, train or validation sets.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

Global TIMIT Thai consists of 12 hours of read speech and time-aligned transcripts in Standard Thai from 50 speakers (33 female, 17 male) reading 120 sentences selected from the Thai National Corpus, the Thai Junior Encyclopedia, and Thai Wikipedia, for a total of 6000 utterances. Data was collected in 2016. Speakers were recruited in the Bangkok metropolitan area; they were native Thais, fluent in Standard Thai, and literate.
 
This data set was developed as part of LDC’s Global TIMIT project which aims to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

Third DIHARD Challenge Evaluation was developed by LDC and contains 33 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.

The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
 

 

Tuesday, November 15, 2022

LDC November 2022 Newsletter

Join LDC for membership year 2023 


Fall 2022 data scholarship recipients


Spring 2023 data scholarship application deadline


30th Anniversary Highlight: CALLFRIEND  


New publications:

BOLT English Translation Treebank – Egyptian Arabic SMS/Chat

Samrómur Children Icelandic Speech 1.0

Third DIHARD Challenge Development

_____________________________________________________________

 

Join LDC for membership year 2023 


It’s time to renew your LDC membership for 2023. Current (2022) members who renew their membership before March 1, 2023 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.


In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 900+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.


Plans for 2023 publications are in progress. Among the expected releases are: 

  • AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news with 1.2 million words of corresponding orthographic transcripts 
  • 2019 NIST SRE: audiovisual and leaderboard challenge sets based on amateur videos and Tunisian Arabic telephone speech, respectively   
  • DEFT English ERE: English text from assorted genres annotated for entities, relations and events  
  • Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, plus interviews and transcript readings)  
  • CALLFRIEND Russian: 100 telephone conversations among native speakers, transcripts and a lexicon, released in separate speech and text data sets 
  • REMIX Telephone Collection: English telephone speech from 385 participants in previous Mixer studies 
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, Tamil, Zulu) 

For full descriptions of all LDC data sets, browse our Catalog.  Visit Join LDC for details on membership, user accounts and payment.


Fall 2022 LDC data scholarship recipients

LDC congratulates the following Fall 2022 data scholarship recipients: 
 

  • Nelson Filipe Costa: Concordia University (Canada); PhD, Machine Learning. Nelson is awarded a copy of Penn Discourse Treebank Version 3.0 (LDC2019T05) for his work in discourse relationships and mapping.
  • Paul Pope: University of Eastern Finland (Finland); MA, Linguistic Data Sciences. Paul is awarded a copy of ETS Corpus of Non-Native Written English (LDC2014T06) for his research on text classification. 
  • Abhinav Singh: Sharda University (India); PhD, Forensic Science. Abhinav is awarded a copy of TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) for his research on forensic speech recognition.
  • Lucas Zheng: Deerfield Academy (USA); High School Scholar. Lucas is awarded copies of Arabic Treebank Part 1 v. 4.1 (LDC2010T13) and Arabic Treebank Part 2 v. 3.1 (LDC2011T09)  for his work on analyzing syntactic and lexical similarities across MSA genres and POS-tagging for MSA.
  • Students can learn more about the LDC data scholarship program on the Data Scholarships page.

Spring 2023 data scholarship application deadline


Applications are now being accepted through January 15, 2023 for the Spring 2023 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.


30th Anniversary Highlight: CALLFRIEND  


The CALLFRIEND series is a multi-language collection of unscripted telephone conversations conducted by LDC in the 1990s to support language identification technology development (Liberman & Cieri, 1998). Covered languages are American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. For English, Mandarin and Spanish, the collection includes two distinct dialects. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. 


This speech data was the foundation for NIST’s Language Recognition Evaluations conducted from 1996-2007. The first editions of the CALLFRIEND series published in LDC’s Catalog in 1996 contain 60 calls evenly split into 20 calls each for a training  partition to develop language models, a development partition for parameter tuning, and an evaluation partition to test performance (Torres-Carrasquillo, et al., 2004). 


Beginning in 2014, LDC released second editions for American English (LDC2019S21LDC2020S08), Canadian French (LDC2019S18), Egyptian Arabic (LDC2019S04), Farsi (LDC2014S01), and Mandarin Chinese (LDC2018S09LDC2020S06). The goal of the second editions is to facilitate continued widespread use of the data, specifically, by updating the audio files to .wav format, simplifying the directory structure, adding documentation and metadata, and combining the training, development and evaluation splits. CALLFRIEND Farsi Second Edition also includes additional telephone recordings and a separate transcripts release (LDC2014T01). 


In addition to work on language identification, CALLFRIEND corpora have been used in a variety of research tasks, including subject omission in Korean (Lee 2012), contemporary Persian vowels in casual speech (Jones 2019), Mandarin telephone closings among familiars (Huang, 2020), and adjective constructions in English conversation (Bybee & Thompson, 2021), among many others. 


To learn more about the CALLFRIEND collection or about other LDC corpora used for language identification research, search the Catalog by the “recommended application” and select “language identification” from the list. 


New publications:

 

BOLT English Translation Treebank – Egyptian Arabic SMS/Chat was developed by LDC and consists of SMS and chat text data (472 files representing 98,206 tokens) translated from Egyptian Arabic to English and annotated for part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release. Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included in the corpus documentation. 

 

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.


*


Samrómur Children Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 131 hours of Icelandic prompted speech from 3,175 speakers (children, aged 4-17 years) representing 137,597 utterances.

 

Speech data was collected between October 2019 and September 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.


2022 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.


*


Third DIHARD Challenge Development was developed by LDC and contains approximately 34 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.


The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.


2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

 

 

Monday, October 17, 2022

LDC October 2022 Newsletter

Membership Year 2023 publication preview 

LDC data and commercial technology development 

30th Anniversary Highlight: ACE 

New publications:
Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon
2017 NIST Language Recognition Evaluation Training and Development Sets
LORELEI Bengali Representative Language Pack


Membership Year 2023 publication preview The 2023 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are: 

  • AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news with 1.2 million words of corresponding orthographic transcripts 
  • 2019 NIST SRE: audiovisual and leaderboard challenge sets based on amateur videos and Tunisian Arabic telephone speech, respectively   
  • DEFT English ERE: English text from assorted genres annotated for entities, relations, and events  
  • Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, plus interviews and transcript readings)  
  • CALLFRIEND Russian: 100 telephone conversations among native speakers, transcripts, and a lexicon, released in separate speech and text data sets 
  • REMIX Telephone Collection: English telephone speech from 385 participants in previous Mixer studies 
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, Tamil, Zulu) 


Check your inbox in the coming weeks for more information about membership renewal.  

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

30th Anniversary Highlight: ACE 
The objective of the Automatic Content Extraction (ACE) program was to develop the capability to extract meaning (entities, relations and events) from multimedia sources (Doddington, et al., 2004). LDC supported ACE by creating annotation guidelines, corpora and other linguistic resources, including training and test data for the common task research evaluations (Strassel, et al., 2003Huang, et al., 2004).
There are multiple data sets in LDC’s Catalog from the program. One that regularly makes the list of LDC’s top ten most licensed corpora is ACE 2005 Multilingual Training Corpus (LDC2006T06). This data set contains 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech. 
Another popular data set, ACE 2004 Multilingual Training Corpus (LDC2005T09), consists of varied genre text in English (158,000 words), Chinese (307,000 characters, 154,000 words), and Arabic (151,000 words) annotated for entities and relations.
ACE 2007 Multilingual Training Corpus (LDC2014T18) has the complete set of Arabic and Spanish training data for the 2007 ACE technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions.
Other ACE corpora in the Catalog include ACE 2005 SpatialML Annotations in English and Mandarin (LDC2008T03LDC2010T09, LDC2011T02), Datasets for Generic Relation Extraction (reACE)TIDES Extraction (ACE) 2003 Multilingual Training DataACE-2 Version 1.0ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (TERN), and more. 

For the full list of available ACE data, visit LDC’s Catalog and select the ACE research project in the search menu. For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, visit LDC's ACE webpage.


New publications:Rime-Cantonese: A Normalized Cantonese Jyutping Lexicon was developed by the Cantonese Computational Linguistics Infrastructure Working Group. It contains approximately 130,000 Cantonese character, word, and phrase entries paired with their corresponding romanized pronunciations in Jyutping, a scheme created by The Linguistic Society of Hong Kong.Data was collected from a variety of physical and online sources. The character collection was subjected to a normalization process for differences between traditional and simplified Chinese, regional differences and other variants in Chinese characters, and differences in orthography.2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.


*

2017 NIST Language Recognition Evaluation Training and Development Sets contains training and development material for the 2017 NIST Language Recognition Evaluation. It consists of 2,100 hours of conversational telephone speech, broadcast conversation, broadcast narrow band speech, and speech from video in the following 14 languages, dialects, and varieties: Arabic (Iraqi, Levantine, Maghrebi, Egyptian), English (British, American), Polish, Russian, Portuguese (Brazilian), Spanish (Caribbean, European, Latin American Continental), and Chinese (Mandarin, Min Nan). The 2017 evaluation focused on differentiating closely related language pairs. Source data is from LDC's CALLFRIEND and Fisher telephone collections, the VAST video collection, various broadcast sources, and earlier NIST LRE test sets.
 
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

LORELEI Bengali Representative Language Pack was developed by LDC and is comprised of approximately 144 million words of Bengali monolingual text, 96,000 Bengali words translated from English data, and over 2 million words of found Bengali-English parallel text. Approximately 86,000 words were annotated for named entities and up to 25,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from news, social network, and weblogs.The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, September 15, 2022

LDC September 2022 Newsletter

Upcoming Policy Change to LDC’s Open Memberships

LDC at Interspeech 2022

LanguageARC: Citizen Science for Language

30th Anniversary Highlight: Switchboard 

New publications:
Xi’an Guanzhong Object Naming
MASRI Synthetic
_____________________________________________________________

Upcoming Policy Change to LDC’s Open Memberships

LDC is changing its open membership year policy beginning January 1, 2023.  Only one membership year will be open for joining – the current membership year. The 2022 membership year will close for joining on December 31, 2022. We expect this change to have a minimal impact on members, while allowing us to streamline our processes to serve members better. LDC’s many membership benefits will remain the same and organizations choosing to join membership years in advance will still be able to do so. If you have any questions about this change, please don’t hesitate to contact our membership office.

LDC at Interspeech 2022
 
LDC is proud to sponsor the Workshop for Young Female Researchers in Speech (YFRSW) to be held in-person as an Interspeech 2022 pre-conference satellite event on September 17. Also, be sure to check out the collaborative work of LDC’s Mark Liberman, “The mapping between syntactic and prosodic phrasing in English and Mandarin”, presented during the On-Site Oral Session: Phonetics and Phonology on Wednesday, September 21, 13:30-15:30 KST. 

LanguageArc: Citizen Science for Language 

LanguageARC is a citizen science web portal for language research developed by LDC with the support of the National Science Foundation (grant #1730377). 

LanguageARC brings together researchers and participants from the general public interested in language to form a community dedicated to support and advance language-related research and development. Contributors to this online community can participate in a variety of language-related tasks and activities such as reading text, answering questions, describing images or video, creating or evaluating transcriptions for audio clips or developing translations into their native languages. LanguageARC includes projects in languages other than English, such as French, Sesotho and Swedish. Xi’an Guanzhong Object Naming LDC2022S09, released this month in LDC’s Catalog and described below, is an example of a data set developed using LanguageArc. New projects will be added on an ongoing basis.
 
Sign up for a LanguageARC account today to start making real contributions to language knowledge and research. Please share this information with colleagues, students and anyone who might be interested in participating in the language activities on this website. If you are a researcher interested in creating a project on Language ARC, please reach out on the site’s Contact page.
 
Find LanguageArc on Facebook at: https://www.facebook.com/languagearc

30th Anniversary Highlight: Switchboard 

Switchboard-1 Release 2 (LDC97S62) is considered the first large collection of spontaneous conversational telephone speech (Graff & Bird, 2000). It consists of approximately 260 hours of recordings collected by Texas Instruments in 1990-1991 (Godfrey et al., 1992). The first release of the corpus (later superseded) was published by NIST and distributed by LDC in 1993.

Participants were 543 speakers (302 male, 241 female) from across the United States who accounted for around 2,400 two-sided telephone conversations. A robot operator handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. Roughly 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. 

This gold standard data set has been used for many HLT applications, including speaker identification, speaker authentication, and speech recognition. It is considered one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (Deshmukh et al., 1998) as well as a key resource for studying the phonetic properties of spontaneous speech (Greenberg et al., 1996). Annotation tasks based on Switchboard include discourse tags/speech acts, part-of-speech tagging and parsing, and sentiment analysis.  

The Switchboard series includes Switchboard Credit CardPhase IIPhase III, the Switchboard Cellular collection, and new recordings from 18 Switchboard participants in the 2013 Greybeard corpus.

All Switchboard corpora are available in the Catalog for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publications:

Xi’an Guanzhong Object Naming  is comprised of 15 hours of audio recordings from speakers of the Guanzhong dialect of Mandarin Chinese living in or near Xi'an in Shaangxi Province (China) naming objects that appeared in colored line drawings. The corpus was developed to support traditional and computer aided language documentation.
 
The collection was conducted from February-May 2021 using LanguageArc, a citizen science portal developed by LDC, from a closed volunteer community. Speakers were presented with images selected from the MultiPic dataset and were asked to record themselves naming the objects in the images.
 
Xi’an Guanzhong Object Naming is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

MASRI Synthetic MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and contains 99 hours of synthesized Maltese speech. 

Source sentences were extracted from the Maltese Language Resource Server (MLRS) corpus, comprised of written or transcribed Maltese covering various genres, including parliamentary debates, news, law, opinion, sports, culture, academic, literature and religious texts. Text was processed through the CrimsonWing text-to-speech system to generate speech files. Synthesized speech was created with 210 voices (105 female, 105 male).

MASRI Synthetic is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Thursday, August 18, 2022

LDC August 2022 Newsletter

Fall 2022 LDC Data Scholarship Program

30th Anniversary Highlight: The LDC Gigawords 

New publication:

HAVIC MED Novel 2 Test – Videos, Metadata and Annotation

 

 

Fall 2022 LDC Data Scholarship Program 

Student applications for the Fall 2022 LDC Data Scholarship program are being accepted now through September 15, 2022. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

30th Anniversary Highlight: The LDC Gigawords 

Giga: a combining form meaning “billion,” used in the formation of compound words (Source: https://www.dictionary.com/browse/giga-)

LDC’s Gigaword corpora are a natural outgrowth of its vast decades-long multi-language newswire collection. Newswire data was originally collected, annotated, and distributed for use in many sponsored projects and was also released through the LDC catalog in tailored data sets. Then came the idea of making LDC’s entire newswire collection available by language with a simple, minimal markup to support a broad range of NLP/HLT tasks. The first ArabicChinese and English gigaword editions were released in 2003; subsequent cumulative releases through fifth editions in 2011 represent LDC’s newswire collection spanning 1994-2010 in those languages. French and Spanish gigawords were first published in 2006, culminating in the release of third editions in 2011, likewise covering newswire collected by LDC through 2010.

The community has used, and continues to use, these data sets in numerous ways. Automatic text summarization is a favorite, and current work in this area applies deep learning principles (see, e.g., Gao et al. 2020, English). Gigawords are also useful for text source classification (Huang et al. 2003, Chinese), information extraction (Lan et al. 2020, Arabic), knowledge extraction and distributional semantics (Napoles et al. 2012, English) and natural language understanding (Ganitkevitch 2013, English), among other fields. Recent variations like the annotated and concretely annotated English gigawords add syntactic, semantic, and coreference annotations to this billion word text collection. 

All Gigaword corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publication: 

HAVIC MED Novel 2 Test – Videos, Metadata and Annotation is comprised of 6,200 hours of user-generated videos with annotation and metadata developed by LDC for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos). Each event video was manually annotated with judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.
 
HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Sunday, July 17, 2022

LDC July 2022 Newsletter

Fall 2022 LDC Data Scholarship Program

30th Anniversary Highlight: ATIS0 Complete 

New publications:

Qatari Corpus of Argumentative Writing

Second DIHARD Challenge Evaluation - SEEDLingS

 

 

Fall 2022 LDC Data Scholarship Program 

Student applications for the Fall 2022 LDC Data Scholarship program are being accepted now through September 15, 2022. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page

30th Anniversary Highlight: ATIS0 Complete 

The ATIS corpora were among the first publications that appeared with the launch of LDC’s catalog in 1993. ATIS0 Complete (LDC93S4A) is comprised of spontaneous speech, read speech and other material from participants in the ATIS collection that is contained in ATIS0 Pilot (LDC93S4B), ATIS0 Read (LDC93S4B-2) and ATIS0 SD-Read (LDC93S4B-3).

The ATIS (Air Travel Information Services) collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory for Computer Science, National Institute for Standards and Technology and SRI International.

The ATIS collection has been widely used to further research in spoken language understanding and slot filling (Kuo et al., 2020). Other data sets published from the collection include ATIS2 (LDC93S5), ATIS3 Training and Test Data (LDC94S19, LDC95S26) and, more recently, Multilingual ATIS (LDC2019T04) and ATIS - Seven Languages (LDC2021T04).

All ATIS corpora are available for licensing by Consortium members and non-members. Visit Obtaining Data for more information.

New publications:

(1) Qatari Corpus of Argumentative Writing  was developed by Qatar UniversityUniversity of Exeter and Hamad Bin Khalifa University and is comprised of approximately 200,000 tokens of Arabic and English writing by undergraduate students (159 female, 36 male) along with annotations and related metadata. Students were native Arabic speakers and fluent in English; each student wrote one Arabic and one English essay in response to specific argumentative prompts. They were instructed to include in their essays a clear thesis statement supported by relevant evidence.
 
The corpus is divided into Arabic and English parts, each of which contains 195 essays. Metadata includes information about the students (gender, major, first language, second language) and information about the essay texts (serial numbers of texts, word limits, genre, date of writing, time spent on writing, place of writing).

Qatari Corpus of Argumentative Writing is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

* 

(2) Second DIHARD Challenge Evaluation - SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challenge.
 
Source data is from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First and Second DIHARD Challenges.

Second DIHARD Challenge Evaluation - SEEDLingS is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.