Linguistic Data Consortium: 2021

Wednesday, December 15, 2021

LDC December 2021 Newsletter

LDC 2022 Membership Discounts Now Available

Approaching Deadline for Spring 2022 Data Scholarship Applications

Citizen Linguistics

LDC Closed for Winter Break Dec. 24-Jan. 4

New Publications:

BOLT English Translation Treebank – Chinese SMS/Chat

HAVIC MED Training Data – Videos, Metadata and Annotation

LDC 2022 Membership Discounts Now Available
Now through March 1, 2022, current 2021 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching Deadline for Spring 2022 Data Scholarship Applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2022 data scholarships are due January 15, 2022. For more information on requirements and program rules, see LDC Data Scholarships.

Citizen Linguistics
LanguageARC (https://languagearc.com), a citizen science web portal for linguistics, continues to grow with 12 language research projects currently available to the community. Two new projects seeking contributions from citizen linguists have recently been added. The Fearless Steps project will make thousands of hours of Apollo space mission communications accessible to researchers and to the public. Contributors can listen to and annotate actual audio recordings from the Apollo 11 space mission. A second new project, Les stéréotypes en français, asks contributors to identify and classify stereotypes that can be expressed in the French language. In addition to these publicly available projects, LanguageARC also enables researchers to create research projects restricted to defined private groups, such as the recent object naming task to document the Guanzhong dialect of Mandarin. Here a private, invited group of about 60 contributors yielded over 34,000 speech recordings.

Please consider becoming an active participant in the LanguageARC community by contributing to research projects. If you are a researcher interested in creating your own project on LanguageARC, please reach out via the “Contact” page on the website.

LDC Closed for Winter Break Dec. 24-Jan. 4
LDC will be closed from Friday December 24, 2021 through Tuesday, January 4, 2022 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 5, 2022. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:

(1) BOLT English Translation Treebank – Chinese SMS/Chat was developed by LDC and consists of SMS/Chat text data translated from Chinese to English and annotated for part-of-speech and syntactic structure.

The source data is Chinese SMS and chat text collected by LDC between 2010 and 2013. A subset of the translated text -- 194 files representing 108,385 tokens -- was selected for treebanking. Part-of-speech and treebank annotation conform to Penn Treebank II style. Supplementary guidelines for English treebanks and web text are included with this release.

BOLT English Translation Treebank – Chinese SMS/Chat is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) HAVIC MED Training Data – Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 2,100 hours of user-generated videos with annotation and metadata developed for the 2011-2015 NIST-sponsored MED (Multimedia Event Detection) tasks.

The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Training Data -- Videos, Metadata and Annotation is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Monday, November 15, 2021

LDC November 2021 Newsletter

Join LDC for Membership Year 2022

Spring 2022 Data Scholarship Application Deadline

New Publications:

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Second DIHARD Challenge Development – Eleven Sources

Second DIHARD Challenge Development - SEEDLingS

________________________________________________________________

Join LDC for Membership Year 2022

Membership Year 2022 (MY2022) is open and discounts are available for those who keep their membership current and join early. Current MY2021 members who renew their LDC membership before March 1, 2022 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount when joining by March 1.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data from our Catalog of 900 holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for MY2022 publications are in progress. Among the expected releases are:

2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation

AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13

Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names

MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts

HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task

DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data

LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof)

It’s not too late to join LDC for MY2020 (through December 31, 2021) and MY2021 (through December 31, 2022). Data sets from those years include 2018 NIST Speaker Recognition Evaluation Test Set, Mixer 4 and 5 Speech, AMR Annotation Release 3.0, Penn Parsed Corpora of Historical English, RATS Speaker Identification, BOLT Egyptian Arabic and Chinese resources (treebanks, propbanks, co-reference), Global TIMIT Mandarin Chinese, and MyST Children’s Conversational Speech.

For full descriptions of all LDC data sets, browse our Catalog.

Visit Join LDC for details on membership, user accounts and payment.

Spring 2022 Data Scholarship Application Deadline

Applications are now being accepted through January 15, 2022 for the Spring 2022 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.

New publications:

(1) BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) for the DARPA BOLT program and consists of propbank annotation on Egyptian Arabic informal text and telephone speech.

Propbank annotation provides a layer of semantic annotation over treebank. In this release, it was applied to BOLT phrase structure treebank annotation and was carried out in two phases: (1) a frame file for each predicate was created, and (2) the predicate argument structure was annotated using the frame file as a reference.

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

(2) Second DIHARD Challenge Development - Eleven Sources was developed by LDC and contains approximately 22 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge.

The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As with the first challenge, the second development and evaluation sets were drawn from a diverse sampling of sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and amateur web videos.

Second DIHARD Challenge Development – Eleven Sources is distributed via web download.

(3) Second DIHARD Challenge Development - SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challenge. The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly.

Source data is from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the first and second DIHARD Challenges.

The data in this release consists of files provided in the Second DIHARD Challenge as well as subsequently updated annotated files not provided to second challenge participants.

Second DIHARD Challenge Development – SEEDLingS is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, October 14, 2021

LDC October 2021 Newsletter

Fall 2021 data scholarship recipients

Membership Year 2022 publication preview

LDC data and commercial technology development

New Publications:

UCLA Variability Speaker Database

BOLT Egyptian Arabic Treebank – SMS/Chat

_______________________________________________________

Fall 2021 data scholarship recipients

Congratulations to the recipients of LDC's Fall 2021 data scholarships:

Sophia Minnillo: University of California, Davis (USA); PhD, Linguistics. Sophia is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for her research on the use of transition markers by Chinese L1 speakers.

Jagabandhu Mishra: Indian Institute of Technology Dharwad (India); Research Scholar, Electrical Engineering. Jagabandhu is awarded a copy of Mandarin-English Code-Switching in South-East Asia LDC2015S04 for his work in spoken language diarization.

Kashyap Patel: University of Texas at Dallas (USA); Ph.D., Electrical Engineering. Kashyap is awarded copies of CSR-I (WSJ0) Sennheiser LDC93S6B and CSR-II (WSJ1) Sennheiser LDC94S13B for his research in audio, acoustic and speech signal processing.

Yoshani Ranaweera, D. Dissanayaka, S. Sudasinghe: University of Moratuwa (Sri Lanka); Bachelors, Computer Science and Engineering. This group is awarded a copy of CALLHOME American English Speech LDC97S4 for their work in speaker diarization.

Winie Wong: University of Illinois at Chicago (USA); PhD, Electrical and Computer Engineering. Winie is awarded copies of ISI Chinese-English Automatically Extracted Parallel Text LDC2007T09 and GALE Phase 3 and 4 Chinese Broadcast News Parallel Text LDC2016T15 for her research in machine translation.

For information about the program, visit the Data Scholarships page.

Membership Year 2022 publication preview

The 2022 Membership Year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation
AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13
Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names
MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts
HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task
DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof)

Check your inbox in the coming weeks for more information about membership renewal. 

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) UCLA Variability Speaker Database was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of approximately 34 hours of English speech and orthographic transcripts. Speakers (101 female, 101 male) took part in six tasks: vowel sounds, reading sentences, giving instructions, neutral conversation, happy conversation, a phone conversation, annoyed conversation, and responding to a video. This corpus was designed to sample variability in speaking within individual speakers and across a large number of speakers.

UCLA Variability Speaker Database is distributed via web download.

(2) BOLT Egyptian Arabic Treebank – SMS/Chat was developed by LDC and consists of Egyptian Arabic SMS/Chat data with part-of-speech annotation, morphology, and syntactic tree annotation. This release contains 349,414 tokens before clitics were split and 435,677 tree tokens after clitics were split for treebank annotation. The source data was collected by LDC from its collection platform or by donation and was manually reviewed to exclude material not in the target language or with sensitive content. Originally written in Arabizi (Romanized/Latin characters) script, the source SMS/chat text was transliterated to Arabic script and manually corrected prior to treebank annotation. Annotations followed Penn Arabic Treebank guidelines.

BOLT Egyptian Arabic Treebank – SMS/Chat is distributed via web download.