Monday, May 16, 2022

LDC May 2022 Newsletter

30th Anniversary Highlight: Penn Treebank 

New publications:
Samrómur Icelandic Speech 1.0
_______________________________________________________________

30th Anniversary Highlight: Penn Treebank 
LDC’s Catalog features classic corpora responsible for critical advances in human language technology that continue to influence researchers. Among them are the Penn Treebank releases, Treebank-2 (LDC96T7) and Treebank-3 (LDC99T42).

The Penn Treebank project (1989-1996) produced seven million words tagged for part-of-speech, three million words of parsed text, over two million words annotated for predicate-argument structure and 1.6 million words of transcribed speech annotated for speech disfluencies (Taylor et al., 2003). Source material represents a diverse range of data, including Wall Street Journal (WSJ) articles, the Brown Corpus and Switchboard telephone conversations. 

Penn Treebanks are used for a wide range of purposes, including the creation and training of parsers and taggers, work on machine translation and speech recognition, and research concerning joint syntactic and semantic role labeling. Their ongoing influence is evidenced by the popularity of Treebank-3 (LDC99T42), which continues to be one of LDC’s top ten most distributed corpora in the Catalog. In addition, the WSJ section has served as a model for treebanks across many languages (Nivre, 2008).

The Penn Treebank has inspired related annotation schemes, such as Proposition Bank, the Penn Discourse Treebank project, and word alignment annotation. In addition, LDC has developed revised English treebank guidelines resulting in the re-issue of the WSJ section (English News Text Treebank: Penn Treebank Revised (LDC2015T13)) and treebanked web text (e.g., English Web Treebank (LDC2012T13) and BOLT English Translation Treebank – Chinese Discussion Forum (LDC2020T09)).   

Penn Treebank corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data

New publications:
(1) NUBUC (NyU-BU contextually controlled stories Corpus) was developed by New York UniversityMax Planck Institute for Empirical Aesthetics and Boston University. It contains approximately three hours of English read speech from eight stories focused on linguistic keywords that were created specifically for this corpus, along with transcripts, syntactic annotations and corpus metadata.

Stories are centered on a protagonist and bear a similarity to a modern fairy tale. Each story consists of approximately 2,000 words organized around critical keywords matched along multiple linguistic dimensions. The story texts comprise a total of 1024 sentences and 16,472 words. Each story was read by two different voice actors, one male and one female, in a neutral American English accent. 

Recordings are 11-12 minutes in duration, for a total of about 90 minutes of continuous speech per speaker.

NUBUC is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*
(2) Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances.

Speech data was collected between October 2019 and May 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.

Samrómur Icelandic Speech 1.0 is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Friday, April 15, 2022

LDC April 2022 Newsletter

LDC Celebrates 30 Years

LDC Releases Ukrainian Data for Disaster and Refugee Relief Research

New publication:
LORELEI Wolof Representative Language Pack
_______________________________________________________________

LDC Celebrates 30 Years
April 2022 marks the beginning of LDC’s 30th year as the leader in language resource development and distribution. Founded in 1992, the Consortium has grown from a data repository to a vibrant data center that creates, shares and preserves language resources for research, education and technology development. The Catalog continues to grow, housing over 900 titles in more than 90 languages. With the support of members, licensees, sponsors and collaborators, LDC has distributed over 200,000 copies of data to more than 6,000 organizations worldwide. We are sincerely grateful to the community, and we pledge to continue the mission to provide diverse data, high-quality member services and research program support. 

Stay tuned for upcoming newsletter highlights from the last three decades! 

LDC Releases Ukrainian Data for Disaster and Refugee Relief Research
LDC is releasing Ukrainian data it developed in the DARPA AIDA program, the NIST Language Recognition Evaluation series and the DARPA LORELEI program under a special no-cost, limited license for disaster and refugee relief research. 

These resources are available in three corpora:

LDC2022E06     AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LDC2020T24     LORELEI Ukrainian Representative Language Pack
LDC2020T10     LORELEI Entity Detection and Linking Knowledge Base

For further information about these data sets and licensing terms, see Disaster and Refugee Relief Research.

New publication:

LORELEI Wolof Representative Language Pack was developed by LDC and is comprised of approximately 225,000 words of Wolof monolingual text, 115,000 Wolof words translated from English data, 15,000 words annotated for named entities and 5,000-8,000 words annotated for entity discovery and linking and situation frames. 

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

Data was collected from news, social network, weblog, discussion forum, and reference material. Entity detection and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Wolof Representative Language Pack is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, March 16, 2022

LDC March 2022 Newsletter

LDC data and commercial technology development 

New Publications:
AttImam
_______________________________________________________________
 
LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1)  AttImam was developed by Al-Imam Mohammad Ibn Saud Islamic University and consists of approximately 2,000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 (LDC2010T13). Attribution refers to the process of reporting or assigning an utterance to the correct speaker.  

The source Arabic newswire was collected by LDC from Agence France Presse articles published in 2000. Files were annotated by native Arabic speakers and contain the following elements:
  • Cue: the lexical anchor that connects the source with the content.
  • Source: the entity or the agent that owns the content.
  • Content: the basic element expressing the claim or the reported news.
  • General Features: these can include such features as attribution style (direct or indirect), determinacy (factual or non-factual), and purpose (e.g., assertion, expression).
AttImam is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2)  HAVIC MED Novel 1 Test – Videos, Metadata and Annotation is comprised of 3,800 hours of user-generated videos with annotation and metadata developed by LDC for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos). Each event video was manually annotated with judgments describing its event properties and other salient features. 

Background videos were labeled with topic and genre categories.

HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.
 

Tuesday, February 15, 2022

LDC February 2022 Newsletter

LDC Membership Discounts Expire March 1 

New Publications:

The Child Subglottal Resonances Database

Spoken Digits in Hindi and Indian English

 

LDC Membership Discounts Expire March 1 

There is still time to save on 2022 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.  

New publications:

(1) The Child Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 15.5 hours of simultaneous microphone and subglottal accelerometer recordings from 19 male and 9 female child speakers of American English aged 7-17.

The subglottal system is composed of the airways of the tracheobronchial tree and the surrounding tissues. It powers airflow through the larynx and vocal tract, allowing for the generation of most of the sound sources used in languages around the world. The subglottal resonances (SGRs) are the natural frequencies of the subglottal system. During speech, the subglottal system is acoustically coupled to the vocal tract via the larynx. SGRs can be measured from recordings of the vibration of the skin of the neck during phonation by an accelerometer, much like speech formants are measured through microphone recordings.

The corpus consists of 34 monosyllables in a phonetically neutral carrier phrase (“I said a ____ again”), with a median of 6 repetitions of each word by each speaker, resulting in 5,247 individual microphone (and accelerometer) waveforms. Speaker metadata is included. 

The Child Subglottal Resonances Database is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Spoken Digits in Hindi and Indian English was developed by the Birla Institute of Technology and Science Pilani and contains two hours of speech from Hindi and English speakers with regional accents from across India saying the digits 1-10. The data was collected in person on a mobile handset recorder app, by one-to-one online communications over social apps, and from social media sites. Each audio file represents a single spoken digit in either Hindi or Indian English. Background noise was mostly retained. Some data was recorded in a noise-free environment or cleaned after recording to avoid abrupt noises such as car horns. Speaker metadata is included.

Spoken Digits in Hindi and Indian English is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, January 18, 2022

LDC January 2022 Newsletter

Renew your LDC Membership today 

New Publications:

2017 NIST OpenSAT Pilot - SSSF

LORELEI Kinyarwanda Incident Language Pack
_____________________________________________________________

Renew your LDC Membership today 

The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 900 holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2022, 2021 members receive a 10% discount on 2022 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 

New publications:

(1) 2017 NIST OpenSAT Pilot - SSSF was developed by NIST (National Institute of Standards and Technology) and contains approximately one hour of operational speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition and keyword search tasks of the 2017 OpenSAT Pilot evaluation. The source audio consists of radio and telephone dispatches during the Sofa Super Store fire (Charleston, South Carolina) in June 2007 (SSSF). 

The OpenSAT evaluation series was designed to bring together researchers developing different types of technologies to address speech analytic challenges present in some of the most difficult acoustic conditions The 2017 pilot focused on the public safety communications domain. The SSSF audio represents real-world, fire response, operational data with multiple challenges for system analytics, such as land-mobile-radio transmission effects, significant background noise, speech under stress and variable decibel levels.  

2017 NIST OpenSAT Pilot - SSSF is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(2) LORELEI Kinyarwanda Incident Language Pack was developed by LDC and is comprised of approximately 11.9 million words of Kinyarwanda monolingual text, 35,000 words of English monolingual text, 3.4 million words of parallel and comparable Kinyarwanda-English text, and 50,000 words each of English and Kinyarwanda data annotated for Entity Discovery and Linking and Situation Frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Kinyarwanda language that were used in the DARPA LORELEI / LoReHLT 2018 Evaluation

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity detection and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Kinyarwanda Incident Language Pack is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, December 15, 2021

LDC December 2021 Newsletter

LDC 2022 Membership Discounts Now Available 

Approaching Deadline for Spring 2022 Data Scholarship Applications 

Citizen Linguistics

LDC Closed for Winter Break Dec. 24-Jan. 4


New Publications:

LDC 2022 Membership Discounts Now Available  
Now through March 1, 2022, current 2021 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

Approaching Deadline for Spring 2022 Data Scholarship Applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2022 data scholarships are due January 15, 2022. For more information on requirements and program rules, see LDC Data Scholarships

Citizen Linguistics
LanguageARC (https://languagearc.com), a citizen science web portal for linguistics, continues to grow with 12 language research projects currently available to the community. Two new projects seeking contributions from citizen linguists have recently been added. The Fearless Steps project will make thousands of hours of Apollo space mission communications accessible to researchers and to the public. Contributors can listen to and annotate actual audio recordings from the Apollo 11 space mission. A second new project, Les stéréotypes en français, asks contributors to identify and classify stereotypes that can be expressed in the French language. In addition to these publicly available projects, LanguageARC also enables researchers to create research projects restricted to defined private groups, such as the recent object naming task to document the Guanzhong dialect of Mandarin. Here a private, invited group of about 60 contributors yielded over 34,000 speech recordings.

Please consider becoming an active participant in the LanguageARC community by contributing to research projects. If you are a researcher interested in creating your own project on LanguageARC, please reach out via the “Contact” page on the website.

LDC Closed for Winter Break Dec. 24-Jan. 4
LDC will be closed from Friday December 24, 2021 through Tuesday, January 4, 2022 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 5, 2022. Requests received by the Membership Office during Winter Break will be processed when the office reopens. 

New publications:

(1) BOLT English Translation Treebank – Chinese SMS/Chat was developed by LDC and consists of SMS/Chat text data translated from Chinese to English and annotated for part-of-speech and syntactic structure.  

The source data is Chinese SMS and chat text collected by LDC between 2010 and 2013. A subset of the translated text -- 194 files representing 108,385 tokens -- was selected for treebanking. Part-of-speech and treebank annotation conform to Penn Treebank II style. Supplementary guidelines for English treebanks and web text are included with this release.

BOLT English Translation Treebank – Chinese SMS/Chat is distributed via web download.  

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) HAVIC MED Training Data – Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 2,100 hours of user-generated videos with annotation and metadata developed for the 2011-2015 NIST-sponsored MED (Multimedia Event Detection) tasks.

The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Training Data -- Videos, Metadata and Annotation is distributed via web download. 

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Monday, November 15, 2021

LDC November 2021 Newsletter

Join LDC for Membership Year 2022 

Spring 2022 Data Scholarship Application Deadline 


New Publications:

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Second DIHARD Challenge Development – Eleven Sources

Second DIHARD Challenge Development - SEEDLingS

________________________________________________________________

Join LDC for Membership Year 2022

Membership Year 2022 (MY2022) is open and discounts are available for those who keep their membership current and join early. Current MY2021 members who renew their LDC membership before March 1, 2022 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount when joining by March 1.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data from our Catalog of 900 holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for MY2022 publications are in progress. Among the expected releases are:

  • 2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation
  • AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13
  • Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names
  • MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts 
  • HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task
  • DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof)
It’s not too late to join LDC for MY2020 (through December 31, 2021) and MY2021 (through December 31, 2022). Data sets from those years include 2018 NIST Speaker Recognition Evaluation Test Set, Mixer 4 and 5 Speech, AMR Annotation Release 3.0, Penn Parsed Corpora of Historical English, RATS Speaker Identification, BOLT Egyptian Arabic and Chinese resources (treebanks, propbanks, co-reference), Global TIMIT Mandarin Chinese, and MyST Children’s Conversational Speech.

For full descriptions of all LDC data sets, browse our Catalog.  

Visit Join LDC for details on membership, user accounts and payment.


Spring 2022 Data Scholarship Application Deadline

Applications are now being accepted through January 15, 2022 for the Spring 2022 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.


New publications:

(1) BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) for the DARPA BOLT program and consists of propbank annotation on Egyptian Arabic informal text and telephone speech. 

Propbank annotation provides a layer of semantic annotation over treebank. In this release, it was applied to BOLT phrase structure treebank annotation and was carried out in two phases: (1) a frame file for each predicate was created, and (2) the predicate argument structure was annotated using the frame file as a reference. 

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.  

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Second DIHARD Challenge Development - Eleven Sources was developed by LDC and contains approximately 22 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge.

The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As with the first challenge, the second development and evaluation sets were drawn from a diverse sampling of sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and amateur web videos.

Second DIHARD Challenge Development – Eleven Sources is distributed via web download. 

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(3) Second DIHARD Challenge Development - SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challenge. The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly.

Source data is from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the first and second DIHARD Challenges.

The data in this release consists of files provided in the Second DIHARD Challenge as well as subsequently updated annotated files not provided to second challenge participants.

Second DIHARD Challenge Development – SEEDLingS is distributed via web download. 

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.