Linguistic Data Consortium: citizen linguistics

Wednesday, December 15, 2021

LDC December 2021 Newsletter

LDC 2022 Membership Discounts Now Available

Approaching Deadline for Spring 2022 Data Scholarship Applications

Citizen Linguistics

LDC Closed for Winter Break Dec. 24-Jan. 4

New Publications:

BOLT English Translation Treebank – Chinese SMS/Chat

HAVIC MED Training Data – Videos, Metadata and Annotation

LDC 2022 Membership Discounts Now Available
Now through March 1, 2022, current 2021 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching Deadline for Spring 2022 Data Scholarship Applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2022 data scholarships are due January 15, 2022. For more information on requirements and program rules, see LDC Data Scholarships.

Citizen Linguistics
LanguageARC (https://languagearc.com), a citizen science web portal for linguistics, continues to grow with 12 language research projects currently available to the community. Two new projects seeking contributions from citizen linguists have recently been added. The Fearless Steps project will make thousands of hours of Apollo space mission communications accessible to researchers and to the public. Contributors can listen to and annotate actual audio recordings from the Apollo 11 space mission. A second new project, Les stéréotypes en français, asks contributors to identify and classify stereotypes that can be expressed in the French language. In addition to these publicly available projects, LanguageARC also enables researchers to create research projects restricted to defined private groups, such as the recent object naming task to document the Guanzhong dialect of Mandarin. Here a private, invited group of about 60 contributors yielded over 34,000 speech recordings.

Please consider becoming an active participant in the LanguageARC community by contributing to research projects. If you are a researcher interested in creating your own project on LanguageARC, please reach out via the “Contact” page on the website.

LDC Closed for Winter Break Dec. 24-Jan. 4
LDC will be closed from Friday December 24, 2021 through Tuesday, January 4, 2022 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 5, 2022. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:

(1) BOLT English Translation Treebank – Chinese SMS/Chat was developed by LDC and consists of SMS/Chat text data translated from Chinese to English and annotated for part-of-speech and syntactic structure.

The source data is Chinese SMS and chat text collected by LDC between 2010 and 2013. A subset of the translated text -- 194 files representing 108,385 tokens -- was selected for treebanking. Part-of-speech and treebank annotation conform to Penn Treebank II style. Supplementary guidelines for English treebanks and web text are included with this release.

BOLT English Translation Treebank – Chinese SMS/Chat is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) HAVIC MED Training Data – Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 2,100 hours of user-generated videos with annotation and metadata developed for the 2011-2015 NIST-sponsored MED (Multimedia Event Detection) tasks.

The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Training Data -- Videos, Metadata and Annotation is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Wednesday, January 15, 2020

LDC 2020 January Newsletter

Renew Your LDC Membership Today
LREC Workshop for Citizen Linguistics – Call for Papers

New Publications:
Abstract Meaning Representation(AMR) Annotation Release 3.0
Database of Word Level Statistics – Mandarin
LibriVox Spanish
________________________________________________________________________

Renew Your LDC Membership Today

Join LDC for MY2020 while membership savings are still available. Now through March 2, 2020, renewing MY2019 members receive a 10% discount off the 2020 membership fee. New or returning member organizations receive a 5% discount. This year’s planned publications include Mixer 4 and 5 Speech (English telephone speech and interviews), IARPA Babel Language Packs (telephone speech and transcripts in underserved languages), and data from BOLT, DEFT, RATS, TAC KBP and more. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

LREC Workshop on Citizen Linguistics

LDC Researchers and their colleagues are organizing a workshop on Citizen Linguistics and Language Resource Development at LREC 2020 (Language Resource and Evaluation Conference) to take place on May 16, 2020. The workshop includes an open call for papers in language-related citizen science, a tutorial on using the new LanguageARC.org citizen linguistics portal and a special session on best papers using LanguageARC.

________________________________________________________________________

New publications:

(1) Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release updates Abstract Meaning Representation 2.0 (LDC2017T10) with new data, more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

Abstract Meaning Representation (AMR) Annotation Release 3.0 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Database of Word Level Statistics – Mandarin was developed by The Hong Kong Polytechnic University. It provides lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese. It is designed for researchers particularly concerned with language processing of isolated words. Invariant characteristics include each item's lexicality, sampa, pinyin, IPA transcription, lexical tone, syllable structure, syllable length, pinyin length, segment length, dominant PoS, lexical frequency of the dominant PoS, percent of that dominant PoS, and other PoSes associated with the given item.

Database of Word Level Statistics – Mandarin is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) LibriVox Spanish consists of approximately 73 hours of Spanish read speech and transcripts. The audio data was taken from Spanish audiobooks developed by LibriVox, a non-profit project that creates audiobooks from public domain works. The transcripts were developed for this release.

The audio is comprised of sentences from 300 books read by 154 speakers (77 men and 77 women), representing native and non-native Spanish read speech. Audio files were manually segmented and are between three and ten seconds in length. Native Spanish speakers transcribed the audio data.

LibriVox Spanish is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.