Monday, May 16, 2022

LDC May 2022 Newsletter

30th Anniversary Highlight: Penn Treebank 

New publications:
Samrómur Icelandic Speech 1.0
_______________________________________________________________

30th Anniversary Highlight: Penn Treebank 
LDC’s Catalog features classic corpora responsible for critical advances in human language technology that continue to influence researchers. Among them are the Penn Treebank releases, Treebank-2 (LDC96T7) and Treebank-3 (LDC99T42).

The Penn Treebank project (1989-1996) produced seven million words tagged for part-of-speech, three million words of parsed text, over two million words annotated for predicate-argument structure and 1.6 million words of transcribed speech annotated for speech disfluencies (Taylor et al., 2003). Source material represents a diverse range of data, including Wall Street Journal (WSJ) articles, the Brown Corpus and Switchboard telephone conversations. 

Penn Treebanks are used for a wide range of purposes, including the creation and training of parsers and taggers, work on machine translation and speech recognition, and research concerning joint syntactic and semantic role labeling. Their ongoing influence is evidenced by the popularity of Treebank-3 (LDC99T42), which continues to be one of LDC’s top ten most distributed corpora in the Catalog. In addition, the WSJ section has served as a model for treebanks across many languages (Nivre, 2008).

The Penn Treebank has inspired related annotation schemes, such as Proposition Bank, the Penn Discourse Treebank project, and word alignment annotation. In addition, LDC has developed revised English treebank guidelines resulting in the re-issue of the WSJ section (English News Text Treebank: Penn Treebank Revised (LDC2015T13)) and treebanked web text (e.g., English Web Treebank (LDC2012T13) and BOLT English Translation Treebank – Chinese Discussion Forum (LDC2020T09)).   

Penn Treebank corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data

New publications:
(1) NUBUC (NyU-BU contextually controlled stories Corpus) was developed by New York UniversityMax Planck Institute for Empirical Aesthetics and Boston University. It contains approximately three hours of English read speech from eight stories focused on linguistic keywords that were created specifically for this corpus, along with transcripts, syntactic annotations and corpus metadata.

Stories are centered on a protagonist and bear a similarity to a modern fairy tale. Each story consists of approximately 2,000 words organized around critical keywords matched along multiple linguistic dimensions. The story texts comprise a total of 1024 sentences and 16,472 words. Each story was read by two different voice actors, one male and one female, in a neutral American English accent. 

Recordings are 11-12 minutes in duration, for a total of about 90 minutes of continuous speech per speaker.

NUBUC is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*
(2) Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances.

Speech data was collected between October 2019 and May 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.

Samrómur Icelandic Speech 1.0 is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Friday, April 15, 2022

LDC April 2022 Newsletter

LDC Celebrates 30 Years

LDC Releases Ukrainian Data for Disaster and Refugee Relief Research

New publication:
LORELEI Wolof Representative Language Pack
_______________________________________________________________

LDC Celebrates 30 Years
April 2022 marks the beginning of LDC’s 30th year as the leader in language resource development and distribution. Founded in 1992, the Consortium has grown from a data repository to a vibrant data center that creates, shares and preserves language resources for research, education and technology development. The Catalog continues to grow, housing over 900 titles in more than 90 languages. With the support of members, licensees, sponsors and collaborators, LDC has distributed over 200,000 copies of data to more than 6,000 organizations worldwide. We are sincerely grateful to the community, and we pledge to continue the mission to provide diverse data, high-quality member services and research program support. 

Stay tuned for upcoming newsletter highlights from the last three decades! 

LDC Releases Ukrainian Data for Disaster and Refugee Relief Research
LDC is releasing Ukrainian data it developed in the DARPA AIDA program, the NIST Language Recognition Evaluation series and the DARPA LORELEI program under a special no-cost, limited license for disaster and refugee relief research. 

These resources are available in three corpora:

LDC2022E06     AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LDC2020T24     LORELEI Ukrainian Representative Language Pack
LDC2020T10     LORELEI Entity Detection and Linking Knowledge Base

For further information about these data sets and licensing terms, see Disaster and Refugee Relief Research.

New publication:

LORELEI Wolof Representative Language Pack was developed by LDC and is comprised of approximately 225,000 words of Wolof monolingual text, 115,000 Wolof words translated from English data, 15,000 words annotated for named entities and 5,000-8,000 words annotated for entity discovery and linking and situation frames. 

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

Data was collected from news, social network, weblog, discussion forum, and reference material. Entity detection and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Wolof Representative Language Pack is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, March 16, 2022

LDC March 2022 Newsletter

LDC data and commercial technology development 

New Publications:
AttImam
_______________________________________________________________
 
LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1)  AttImam was developed by Al-Imam Mohammad Ibn Saud Islamic University and consists of approximately 2,000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 (LDC2010T13). Attribution refers to the process of reporting or assigning an utterance to the correct speaker.  

The source Arabic newswire was collected by LDC from Agence France Presse articles published in 2000. Files were annotated by native Arabic speakers and contain the following elements:
  • Cue: the lexical anchor that connects the source with the content.
  • Source: the entity or the agent that owns the content.
  • Content: the basic element expressing the claim or the reported news.
  • General Features: these can include such features as attribution style (direct or indirect), determinacy (factual or non-factual), and purpose (e.g., assertion, expression).
AttImam is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2)  HAVIC MED Novel 1 Test – Videos, Metadata and Annotation is comprised of 3,800 hours of user-generated videos with annotation and metadata developed by LDC for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos). Each event video was manually annotated with judgments describing its event properties and other salient features. 

Background videos were labeled with topic and genre categories.

HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.
 

Tuesday, February 15, 2022

LDC February 2022 Newsletter

LDC Membership Discounts Expire March 1 

New Publications:

The Child Subglottal Resonances Database

Spoken Digits in Hindi and Indian English

 

LDC Membership Discounts Expire March 1 

There is still time to save on 2022 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.  

New publications:

(1) The Child Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 15.5 hours of simultaneous microphone and subglottal accelerometer recordings from 19 male and 9 female child speakers of American English aged 7-17.

The subglottal system is composed of the airways of the tracheobronchial tree and the surrounding tissues. It powers airflow through the larynx and vocal tract, allowing for the generation of most of the sound sources used in languages around the world. The subglottal resonances (SGRs) are the natural frequencies of the subglottal system. During speech, the subglottal system is acoustically coupled to the vocal tract via the larynx. SGRs can be measured from recordings of the vibration of the skin of the neck during phonation by an accelerometer, much like speech formants are measured through microphone recordings.

The corpus consists of 34 monosyllables in a phonetically neutral carrier phrase (“I said a ____ again”), with a median of 6 repetitions of each word by each speaker, resulting in 5,247 individual microphone (and accelerometer) waveforms. Speaker metadata is included. 

The Child Subglottal Resonances Database is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Spoken Digits in Hindi and Indian English was developed by the Birla Institute of Technology and Science Pilani and contains two hours of speech from Hindi and English speakers with regional accents from across India saying the digits 1-10. The data was collected in person on a mobile handset recorder app, by one-to-one online communications over social apps, and from social media sites. Each audio file represents a single spoken digit in either Hindi or Indian English. Background noise was mostly retained. Some data was recorded in a noise-free environment or cleaned after recording to avoid abrupt noises such as car horns. Speaker metadata is included.

Spoken Digits in Hindi and Indian English is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, January 18, 2022

LDC January 2022 Newsletter

Renew your LDC Membership today 

New Publications:

2017 NIST OpenSAT Pilot - SSSF

LORELEI Kinyarwanda Incident Language Pack
_____________________________________________________________

Renew your LDC Membership today 

The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 900 holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2022, 2021 members receive a 10% discount on 2022 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 

New publications:

(1) 2017 NIST OpenSAT Pilot - SSSF was developed by NIST (National Institute of Standards and Technology) and contains approximately one hour of operational speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition and keyword search tasks of the 2017 OpenSAT Pilot evaluation. The source audio consists of radio and telephone dispatches during the Sofa Super Store fire (Charleston, South Carolina) in June 2007 (SSSF). 

The OpenSAT evaluation series was designed to bring together researchers developing different types of technologies to address speech analytic challenges present in some of the most difficult acoustic conditions The 2017 pilot focused on the public safety communications domain. The SSSF audio represents real-world, fire response, operational data with multiple challenges for system analytics, such as land-mobile-radio transmission effects, significant background noise, speech under stress and variable decibel levels.  

2017 NIST OpenSAT Pilot - SSSF is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(2) LORELEI Kinyarwanda Incident Language Pack was developed by LDC and is comprised of approximately 11.9 million words of Kinyarwanda monolingual text, 35,000 words of English monolingual text, 3.4 million words of parallel and comparable Kinyarwanda-English text, and 50,000 words each of English and Kinyarwanda data annotated for Entity Discovery and Linking and Situation Frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Kinyarwanda language that were used in the DARPA LORELEI / LoReHLT 2018 Evaluation

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity detection and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Kinyarwanda Incident Language Pack is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.