Showing posts with label LDC 2022 Membership. Show all posts
Showing posts with label LDC 2022 Membership. Show all posts

Tuesday, February 15, 2022

LDC February 2022 Newsletter

LDC Membership Discounts Expire March 1 

New Publications:

The Child Subglottal Resonances Database

Spoken Digits in Hindi and Indian English

 

LDC Membership Discounts Expire March 1 

There is still time to save on 2022 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.  

New publications:

(1) The Child Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 15.5 hours of simultaneous microphone and subglottal accelerometer recordings from 19 male and 9 female child speakers of American English aged 7-17.

The subglottal system is composed of the airways of the tracheobronchial tree and the surrounding tissues. It powers airflow through the larynx and vocal tract, allowing for the generation of most of the sound sources used in languages around the world. The subglottal resonances (SGRs) are the natural frequencies of the subglottal system. During speech, the subglottal system is acoustically coupled to the vocal tract via the larynx. SGRs can be measured from recordings of the vibration of the skin of the neck during phonation by an accelerometer, much like speech formants are measured through microphone recordings.

The corpus consists of 34 monosyllables in a phonetically neutral carrier phrase (“I said a ____ again”), with a median of 6 repetitions of each word by each speaker, resulting in 5,247 individual microphone (and accelerometer) waveforms. Speaker metadata is included. 

The Child Subglottal Resonances Database is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Spoken Digits in Hindi and Indian English was developed by the Birla Institute of Technology and Science Pilani and contains two hours of speech from Hindi and English speakers with regional accents from across India saying the digits 1-10. The data was collected in person on a mobile handset recorder app, by one-to-one online communications over social apps, and from social media sites. Each audio file represents a single spoken digit in either Hindi or Indian English. Background noise was mostly retained. Some data was recorded in a noise-free environment or cleaned after recording to avoid abrupt noises such as car horns. Speaker metadata is included.

Spoken Digits in Hindi and Indian English is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, January 18, 2022

LDC January 2022 Newsletter

Renew your LDC Membership today 

New Publications:

2017 NIST OpenSAT Pilot - SSSF

LORELEI Kinyarwanda Incident Language Pack
_____________________________________________________________

Renew your LDC Membership today 

The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 900 holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2022, 2021 members receive a 10% discount on 2022 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 

New publications:

(1) 2017 NIST OpenSAT Pilot - SSSF was developed by NIST (National Institute of Standards and Technology) and contains approximately one hour of operational speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition and keyword search tasks of the 2017 OpenSAT Pilot evaluation. The source audio consists of radio and telephone dispatches during the Sofa Super Store fire (Charleston, South Carolina) in June 2007 (SSSF). 

The OpenSAT evaluation series was designed to bring together researchers developing different types of technologies to address speech analytic challenges present in some of the most difficult acoustic conditions The 2017 pilot focused on the public safety communications domain. The SSSF audio represents real-world, fire response, operational data with multiple challenges for system analytics, such as land-mobile-radio transmission effects, significant background noise, speech under stress and variable decibel levels.  

2017 NIST OpenSAT Pilot - SSSF is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(2) LORELEI Kinyarwanda Incident Language Pack was developed by LDC and is comprised of approximately 11.9 million words of Kinyarwanda monolingual text, 35,000 words of English monolingual text, 3.4 million words of parallel and comparable Kinyarwanda-English text, and 50,000 words each of English and Kinyarwanda data annotated for Entity Discovery and Linking and Situation Frames. It constitutes all of the text data, annotations, supplemental resources and related software tools for the Kinyarwanda language that were used in the DARPA LORELEI / LoReHLT 2018 Evaluation

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. In the evaluation scenario, an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity detection and linking annotation identified entities to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information about needs and relevant issues for planning a disaster response effort.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Kinyarwanda Incident Language Pack is distributed via web download. 

2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.