Linguistic Data Consortium: BOLT English Treebank

Showing posts with label BOLT English Treebank. Show all posts

Wednesday, December 15, 2021

LDC December 2021 Newsletter

LDC 2022 Membership Discounts Now Available

Approaching Deadline for Spring 2022 Data Scholarship Applications

Citizen Linguistics

LDC Closed for Winter Break Dec. 24-Jan. 4

New Publications:

BOLT English Translation Treebank – Chinese SMS/Chat

HAVIC MED Training Data – Videos, Metadata and Annotation

LDC 2022 Membership Discounts Now Available
Now through March 1, 2022, current 2021 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching Deadline for Spring 2022 Data Scholarship Applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2022 data scholarships are due January 15, 2022. For more information on requirements and program rules, see LDC Data Scholarships.

Citizen Linguistics
LanguageARC (https://languagearc.com), a citizen science web portal for linguistics, continues to grow with 12 language research projects currently available to the community. Two new projects seeking contributions from citizen linguists have recently been added. The Fearless Steps project will make thousands of hours of Apollo space mission communications accessible to researchers and to the public. Contributors can listen to and annotate actual audio recordings from the Apollo 11 space mission. A second new project, Les stéréotypes en français, asks contributors to identify and classify stereotypes that can be expressed in the French language. In addition to these publicly available projects, LanguageARC also enables researchers to create research projects restricted to defined private groups, such as the recent object naming task to document the Guanzhong dialect of Mandarin. Here a private, invited group of about 60 contributors yielded over 34,000 speech recordings.

Please consider becoming an active participant in the LanguageARC community by contributing to research projects. If you are a researcher interested in creating your own project on LanguageARC, please reach out via the “Contact” page on the website.

LDC Closed for Winter Break Dec. 24-Jan. 4
LDC will be closed from Friday December 24, 2021 through Tuesday, January 4, 2022 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 5, 2022. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:

(1) BOLT English Translation Treebank – Chinese SMS/Chat was developed by LDC and consists of SMS/Chat text data translated from Chinese to English and annotated for part-of-speech and syntactic structure.

The source data is Chinese SMS and chat text collected by LDC between 2010 and 2013. A subset of the translated text -- 194 files representing 108,385 tokens -- was selected for treebanking. Part-of-speech and treebank annotation conform to Penn Treebank II style. Supplementary guidelines for English treebanks and web text are included with this release.

BOLT English Translation Treebank – Chinese SMS/Chat is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) HAVIC MED Training Data – Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 2,100 hours of user-generated videos with annotation and metadata developed for the 2011-2015 NIST-sponsored MED (Multimedia Event Detection) tasks.

The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Training Data -- Videos, Metadata and Annotation is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Friday, January 15, 2021

LDC 2021 January Newsletter

Renew Your LDC Membership Today

New Publications:
LORELEI Akan Representative Language Pack
ATIS – Seven Languages
BOLT English Treebank – SMS/Chat

_____________________________________________________________________

Renew Your LDC Membership Today
Curated language resources are more important than ever to support research and language technology development, including the expanding fields around remote work, pandemic-related technologies, and non-contact interactions. LDC members enjoy no-cost access to 30+ new corpora released annually, as well as the ability to license legacy data sets at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 1, 2021, 2020 members receive a 10% discount on 2021 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits.

New publications:

(1) LORELEI Akan Representative Language Pack consists of Akan monolingual text, Akan-English parallel text, annotations, supplemental resources, and related software tools developed by LDC for the DARPA LORELEI program.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons, and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

Data was collected from discussion forum, news, reference, social network, and weblog. Data volumes are as follows:

Over 3.3 million words of Akan monolingual text, all of which were translated into English
115,000 Akan words translated from English data

Approximately 2,300 words were annotated for named entities, full entity including nominals and pronouns, entity linking, simple semantic annotation, and situation frame annotation (identifying entities, needs, and issues). Around 2,000 words have morphological segmentation annotation.

LORELEI Akan Representative Language Pack is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) ATIS – Seven Languages was developed by Amazon Web Services, Inc. and consists of 5,871 English utterances from ATIS (Air Travel Information Services) corpora, specifically ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26), translated into six languages: Spanish, German, French, Portuguese, Chinese, and Japanese.

The ATIS collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory of Computer Science, National Institute for Standards and Technology, and SRI International.

The data is separated into 4,978 utterances for training and 893 utterances for testing following the original ATIS division. The source English utterances were manually translated into the six languages and are included in this release. Each utterance was annotated with named entities via table lookup; markers include city, airline, airport names, and dates.

ATIS Seven Languages is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(3) BOLT English Treebank – SMS/Chat was developed by LDC and consists of English SMS and text chat data with part-of-speech and syntactic structure annotation.

The source data consists of 115,667 tokens/words in 484 files of English SMS and text chat collected by LDC using two methods: new collection via LDC's collection platform and donation of SMS or chat archives from BOLT collection participants.

All data was annotated for word-level tokenization, part-of-speech, and syntactic structure. Annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Those changes primarily concerned the tokenization of hyphenated words, part-of-speech, and tree changes necessitated by the tokenization changes, and updates to the syntactic annotation to comply with updated annotation guidelines. Supplementary guidelines for English treebanks and web text are included with this release.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT English Treebank – SMS/Chat is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.