Linguistic Data Consortium: English

Showing posts with label English. Show all posts

Monday, February 17, 2025

LDC February 2025 Newsletter

LDC at LT4ALL 2025

LDC membership discounts expire March 3

Spring 2025 data scholarship recipients

New publications:

AIDA Scenario 3 Practice Topic Source Data and Annotation

MATERIAL Georgian-English Language Pack

______________________________________________________________________

LDC at LT4All 2025
LDC is pleased to be a sponsor of The 2nd International Conference on Language Technologies for All (LT4All 2025), February 24-26, 2025, organized by ELRA and SIGUL, the ELRA/ISCA Special Interest Group on Under-resourced Languages, and in partnership with UNESCO as part of the International Decade of Indigenous Languages (2022-2032). The conference theme, "Advancing Humanism through Language Technologies," focuses on community empowerment within the larger discussion on the many ways technology impacts language communities. The conference will also commemorate the Silver Jubilee of International Mother Language Day (February 21).

LDC membership discounts expire March 3
Time is running out to save on 2025 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 3 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

Spring 2025 data scholarship recipients
Congratulations to the recipients of LDC’s Spring 2025 data scholarships:

Sair Buckle: Charles Sturt University (Australia): PhD student, AI and Cyber Futures Institute. Sair is awarded a copy of Avocado Research Email Corpus LDC2015T03 for her work in behavioral science.

Le Phuoc Thinh Tien, Vietnam National University Ho Chi Minh City (Vietnam); Bachelor’s student, Faculty of Information Technology. Le is awarded a copy of Penn Discourse Treebank Version 3.0 LDC2019T05 for his research in natural logical reasoning.

The next round of applications will be accepted in September 2025. For information about the program, visit the Data Scholarships page.

New publications:

AIDA Scenario 3 Practice Topic Source Data and Annotation was developed by LDC and is comprised of English, Russian and Spanish web documents (text, video, image) and annotations. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 3 scenario focused on the COVID-19 global pandemic. This corpus contains source documents and annotations for the Scenario 3 practice topics.

The corpus contains 1417 root documents; 279 documents were annotated. Annotations include:

Event, relation and entity annotation (64 documents)

Claim frame annotation: claims (true or not) relating to the COVID-19 pandemic (203 documents)

Practice topic query claim frames: example claim frames intended to be used by systems as queries to extract similar claims from additional documents (30 documents)

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

MATERIAL Georgian-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 79 hours of Georgian conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately half of the speech files, and approximately 3% of the speech data was translated into English. This release also includes English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2025 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Thursday, February 15, 2024

LDC February 2024 Newsletter

LDC membership discounts expire March 1

Spring 2024 data scholarship recipients

Four corpora withdrawn from the LDC Catalog

New publications:

Second Language University Speech Intelligibility Corpus

AIDA Scenario 1 Practice Topic Annotation

_________________________________________________________________

LDC membership discounts expire March 1

Time is running out to save on 2024 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

Spring 2024 data scholarship recipients

Congratulations to the recipients of LDC’s Spring 2024 data scholarships:

Jordan Chandler: Université Rennes 2 (France): Master’s student, English Studies. Jordan is awarded a copy of Penn Parsed Corpora of Historical English LDC2020T16 to continue his research on the historical development of adjective, quantifier and article indefiniteness in the English language.

Nikhil Raghav: TCG Crest (India): PhD candidate, Institute for Advancing Intelligence. Nikhil is awarded copies of Third DIHARD Challenge Development LDC2022S12 and Third DIHARD Challenge Evaluation LDC2022S14 for his work in speaker diarization.

Abraham Sanders: Rensselaer Polytechnical Institute (USA): PhD candidate, Cognitive Science. Abraham is awarded copies of Fisher English Training Speech Part 1 Speech LDC2004S13, Fisher English Training Speech Part 1 Transcripts LDC2004T19, Fisher English Training Part 2 Speech LDC2005S13 and Fisher English Training Part 2 Transcripts LDC2005T19, for his work in spoken dialogue systems.

The next round of applications will be accepted in September 2024. For information about the program, visit the Data Scholarships page.

Four corpora withdrawn from the LDC Catalog

We regret to announce that The New York Times Annotated Corpus LDC2008T19 has been withdrawn from the LDC Catalog by the data provider. Because they contain data from LDC2008T19, the following three corpora are also withdrawn from the Catalog: Benchmarks for Open Relation Extraction LDC2014T27, Concretely Annotated New York Times LDC2018T12, and News Sub-domain Named Entity Recognition LDC2023T12. Organizations and individuals who have previously licensed any of these data sets can continue to use them under the terms of their respective special license agreements.

New publications:

Second Language University Speech Intelligibility Corpus was developed by Northern Arizona University, The Pennsylvania State University, and The University of Texas at Dallas. It contains 10.5 hours of English speech collected from 66 international faculty and university students representing 15 language backgrounds at 10 North American universities. This release also includes orthographic transcriptions for all recordings, intelligibility scores for 73% of the files, speaker metadata, and aligned Praat textgrids.

The speech data is comprised of presentations, descriptions, reflections, and microteaching tasks. Speakers were recruited from courses at intensive English programs and oral skills courses for international graduate students seeking to become international teaching assistants.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

AIDA Scenario 1 Practice Topic Annotation was developed by LDC and is comprised of annotations for 212 English, Russian and Ukrainian web documents (text, image and video) from AIDA Scenario 1 Practice Topic Source Data (LDC2023T11), specifically, the set of practice documents designated for annotation in Phase 1.

Annotations are presented as tab separated files in the following categories for each topic:

Mentions: single references in source data to a real-world entity or filler, event, or relation.

Slots: pre-defined roles in an event or relation filled by an argument (entity mention).

Linking: entity mentions linked to entries in the knowledge base as a method of indicating the real-world entity to which an entity referred.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Friday, December 15, 2023

LDC December 2023 Newsletter

LDC 2024 membership discounts now available

Approaching deadline for Spring 2024 data scholarship applications

LDC closed for Winter Break Dec. 25-Jan. 1

New publications:

Kasdi-Merbah (University) Emotional Database in Arabic Speech

TAC-KBP Belief and Sentiment – Comprehensive Training and Evaluation Data 2016-2017
______________________________________________________________

LDC 2024 membership discounts now available

Now through March 1, 2024, current 2023 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching deadline for Spring 2024 data scholarship applications

Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2024 data scholarships are due January 15, 2024. For more information on requirements and program rules, see LDC Data Scholarships.

LDC closed for Winter Break Dec. 25-Jan. 1

LDC will be closed from Monday, December 25, 2023 through Monday, January 1, 2024 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Tuesday, January 2, 2024. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:

Kasdi-Merbah (University) Emotional Database in Arabic Speech was developed by the University of Kasdi Merbah Ouargla and contains two hours of Modern Standard Arabic prompted speech from 500 speakers (254 female, 246 male) representing 5,000 utterances. Each speaker read ten sentences, with two sentences each for five different emotions (sadness, fear, anger, happiness, neutral).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

TAC-KBP Belief and Sentiment – Comprehensive Training and Evaluation Data 2016-2017 includes all training and evaluation data developed by LDC for the Belief and Sentiment tracks: source documents (Chinese, English, and Spanish newswire and discussion forums); gold standard entity, relation, and event annotation; and belief and sentiment annotation.

The goal of the TAC-KBP Belief and Sentiment track was to provide information about beliefs and sentiments held by entities toward other entities, as well as toward events and relations. The gold standard set of labeled entities, relations, and events was used to create a system for automatically labeling belief and sentiment about each possible target (entity, relation or event) and for identifying the entity holding the belief or sentiment.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, April 15, 2021

LDC April 2021 Newsletter

New Publications:

X-SRL: Parallel Cross-lingual Semantic Role Labeling
TAC KBP English Sentiment Slot Filling – Comprehensive Training and Evaluation Data 2013-2014
_____________________________________________________________________________

New Publications:

(1) X-SRL: Parallel Cross-lingual Semantic Role Labeling was developed by Heidelberg University, Department of Computational Linguistics and the Leibniz Institute for the German Language (IDS). It consists of approximately three million words of German, French and Spanish annotated for semantic role labeling. The texts are translations of the English portion of 2009 CoNLL Shared Task Part 2 (LDC2012T04). All sentences have annotations for verbal predicates and share the original English Propbank label set across the four languages.

The 2009 CoNLL Shared Task developed syntactic dependency annotations, including the semantic dependency model roles of both verbal and nominal predicates. The following English data was used in the shared task:

Treebank-2 (LDC95T7): over one million words of annotated English newswire and other text developed by the University of Pennsylvania
Proposition Bank I (LDC2004T14): semantic annotation of newswire text from Treebank-2 developed by the University of Pennsylvania
NomBank v 1.0 (LDC2008T23): argument structure for instances of common nouns in Treebank-2 and Treebank-3 (LDC99T42), developed by New York University

For X-SRL, the English source data was automatically translated using DeepL. Automatic tokenization, lemmatization, part-of-speech tagging and syntactic parsing were then applied to the text. The data was divided into train, development and test partitions. Semantic labels were transferred for the train and development sections, and the test sentences were validated for translation quality, alignment, label transfer, and filtering.

X-SRL: Parallel Cross-lingual Semantic Role Labeling is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) TAC KBP English Sentiment Slot Filling – Comprehensive Training and Evaluation Data 2013-2014 was developed by LDC and contains training and evaluation data produced in support of the 2013 and 2014 TAC KBP Sentiment Slot Filling tracks. The data in this release includes queries, manual runs (human-produced query responses), and assessment results for human- and system-produced query responses. Source data was English news and web text.

The regular English Slot Filling track involved mining information about entities from text using a specified set of "slots", or attributes. The goal of the Sentiment Slot Filling task was to evaluate the quality of detectors for positive and negative sentiment.

TAC KBP English Sentiment Slot filling – Comprehensive Training and Evaluation Data 2013-2014 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, April 15, 2020

LDC 2020 April Newsletter

New Publications:

2018 NIST Speaker Recognition Evaluation Test Set
Abstract Meaning Representation 2.0 - Four Translations
TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013
________________________________________________________________

New publications:

(1) 2018 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology) and contains approximately 396 hours of Tunisian Arabic telephone recordings and English web video speech used as development and test data in the NIST-sponsored 2018 Speaker Recognition Evaluation (SRE). This release also contains answer keys, trial and train files, development data and evaluation documentation.

The SRE task is speaker detection, that is, to determine whether a specified target speaker is speaking during a segment of speech. In addition to the traditional focus on conversational telephone speech recorded over a variety of handset types for the training and test conditions, SRE18 added VOIP (voice over IP) data and audio from video.

The telephone speech data was drawn from the Call My Net 2 (CMN2) collection conducted by LDC in Tunisia in which recruited Tunisian Arabic speakers made multiple calls to friends or relatives for conversations lasting between 8-10 minutes. The speech segments include PSTN (public switched telephone network) and VOIP data.

The English audio was sampled from amateur web videos collected by LDC as part of the Video Annotation for Speech Technology (VAST) project.

2018 NIST Speaker Recognition Evaluation Test Set is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Abstract Meaning Representation 2.0 - Four Translations was developed by researchers at the University of Edinburgh, School of Informatics and consists of Spanish, German, Italian and Chinese Mandarin translations of 5,484 test split sentences (1,371 sentences per language) from Abstract Meaning Representation (AMR) Annotation Release 2.0 (LDC2017T10).

AMR Annotation Release 2.0 is a semantic treebank of over 39,000 English natural language sentences from broadcast conversations, newswire and web text. The translated data in this release was designed for use in cross-lingual parsing.

The source sentences were drawn from material collected by LDC, specifically, discussion forum text from the DARPA BOLT and DARPA DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming, Wall Street Journal text, translated Xinhua news texts, various newswire texts from NIST OpenMT evaluations and weblog data from the DARPA GALE program.

Abstract Meaning Representation 2.0 - Four Translations is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP English Temporal Slot Filling tasks in 2011 and 2013. This release includes queries, manual runs produced by LDC annotators, and the final rounds of assessment results.

The goal of the Temporal Slot Filling task was to identify and capture temporal information in text indicating when a given relation between a slot filling query entity and filler held true. This built upon the technology developed for regular Slot Filling which involved mining information about entities from text.

The corresponding source data collections of English newswire, broadcast material and web text are included in TAC KBP Comprehensive English Source Corpora 2009-2014 (LDC2018T03). The corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - is available in TAC KBP Reference Knowledge Base (LDC2014T16).

TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, March 13, 2020

LDC 2020 March Newsletter

Spring 2020 LDC Data Scholarship recipients
LDC data and commercial technology development

New Publications:
BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training

EVALution
Mixer 4 and 5 Speech

__________________________________________________________________

Spring 2020 LDC Data Scholarship recipients

LDC congratulates the following Spring 2020 Data Scholarship recipients:

Zahra Azin (Istanbul Technical University, Turkey) is awarded a copy of Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02) for her work in Turkish AMR.
Spandan Dey (IIT Kharagpur, India) is awarded a copy of Multi-Language Conversational Telephone Speech – South Asian (LDC2017S14) for his research on automatic language recognition.
Jonathan Downey (University of California, Santa Barbara, US) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his research on second language acquisition and quantitative methodologies for educational measurements.
Nathaniel Fackler (University of Georgia, US) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his work on adult second language acquisition.
B. Senthil Kumar (SSN College of Engineering & Anna University, India) is awarded a copy of 2009 CoNLL Shared Task Part 2 (LDC2012T04) for his research on semantic role labeling.
Ming Li (Colorado School of Mines, US) is awarded a copy of TIDIGITS (LDC93S10) for her research on inferring speech signals from motion data in Internet of Things (IoT) security.
Jialiang Lin (Xiamen University, China) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his project to train and test an automated essay scoring model.

Students can learn more about the LDC Data Scholarship program and the next application cycle on the Data Scholarships page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

__________________________________________________________________

New publications:

(1) BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training was developed by LDC and consists of 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of transcripts of Egyptian Arabic conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC97S45, LDC97T19, LDC2002S37, LDC2002T38, LDC96S49) that was translated into English by professional translation agencies and annotated for the word alignment task.

The BOLT word alignment task was built on treebank annotation. Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC’s BOLT Egyptian Arabic Treebank, which had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.

BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) EVALution was developed by The Hong Kong Polytechnic University. It is comprised of English and Mandarin Chinese data sets -- EVALution 1.0 and EVALution-Man, respectively -- that contain semantic relations and metadata for training and evaluating distributional semantic models.

EVALution 1.0 consists of approximately 7500 English tuples extracted from ConceptNet 5.0 and WordNet 4.0 and filtered through automatic methods and crowd-sourcing. Several semantic relations between word pairs were instantiated, including hypernymy, synonymy, antonymy and meronymy. The corpus also includes additional information that can be used to filter the pairs or to analyze the results, such as relation domain, word frequency, word part-of-speech and word semantic field.

EVALution-MAN consists of Chinese word pairs from two sources: Chinese Wordnet and humans who completed an elicitation task by supplying missing words to sentences. The human-supplied sentence word pairs were then judged by human raters for reliability.

EVALution is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Mixer 4 and 5 Speech was developed by LDC and contains approximately 14,185 hours of audio recordings of conversational telephone speech, interviews, elicitation exercises and transcript readings involving 616 distinct speakers. The material was collected in 2007 as part of the Mixer project – which supported speaker recognition for a variety of research tasks – and recordings in this corpus were used in the 2008 NIST Speaker Recognition Evaluation.

The data in this release was collected by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley, as a collaborative, carefully coordinated activity at both recording sites. The Mixer 4 and 5 collection contains 2,568 recordings made via the public telephone network and 2,152 sessions of multiple microphone recordings in office-room settings.

The telephone protocol connected recruited speakers through a robot operator to carry on casual conversations. In Mixer 4, 400 subjects made ten 10-minute calls; half of those subjects also visited one of the collection sites where they made two telephone calls while also being recorded on a cross-channel platform. In Mixer 5, 300 subjects each completed ten calls and six interview sessions at either LDC or ICSI; those sessions were conducted on a cross channel platform and included a telephone call in one of three vocal-effort conditions - normal, high and low. Mixer participants were nearly all native English speakers, the rest being bilingual English speakers.

This release includes metadata about the calls and speakers, along with time-aligned entries for many of the component portions of the recording sessions.

Mixer 4 and 5 Speech is distributed via hard drive.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Tuesday, September 17, 2019

LDC 2019 September Newsletter

LDC at Interspeech 2019

New Publications:
CALLFRIEND Canadian French Second Edition
BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training
Machine Reading Phase 1 NFL Scoring Training Data

_____________________________________________________________________

LDC at Interspeech 2019

LDC is exhibiting at Interspeech 2019, September 15-19 in Graz, Austria. Stop by Booth F16 to learn more about recent developments at the Consortium and new publications.

Be on the lookout for The Second DIHARD Speech Diarization Challenge (DIHARD II), a special session co-organized by LDC, and the following presentations featuring LDC work:

The Second DIHARD Diarization Challenge: Dataset - task - and baselines
Neville Ryant, Christopher Cieri, Mark Liberman (LDC), Kenneth Church (Baidu, USA), Alejandrina Cristia (Laboratoire de Sciences Cognitives et Psycholinguistique), Jun Du (University of Science and Technology of China), Sriram Ganapathy (Indian Institute of Science)
Oral Session, Tuesday September 17, 10:00 – 10:20, Hall 3

Automatic Detection of Prosodic Focus in American English
Sunghye Cho and Mark Liberman (LDC), Yong-cheol Lee (Cheongju University)
Poster Session, Wednesday September 18, 16:00 – 18:00, Gallery B

Automatic detection of ASD in children using acoustic and text features from brief natural conversations
Sunghye Cho, Mark Liberman, Neville Ryant (LDC), Meredith Cola, Robert T. Schultz, Julia Parish-Morris (Children's Hospital of Philadelphia)
Oral Session, Wednesday September 18, 16:45 – 17:00, Hall 3

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

New publications:

(1) CALLFRIEND Canadian French Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Canadian French. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Canadian French (LDC96S48).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Canadian French Second Edition is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training was developed by LDC for the DARPA BOLT (Broad Operational Language Translation) program and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

This release consists of Chinese source text and chat conversations collected using two methods: new collection via LDC's collection platform and donation of SMS and chat archives from BOLT collection participants. The source data is released as BOLT Chinese SMS/Chat (LDC2018T15).

The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment, as well as tokenized for character alignment by inserting white spaces to separate characters.

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Machine Reading Phase 1 NFL Scoring Training Data was developed by LDC for use in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. It contains 110 U.S. NFL (National Football League) scoring source documents and 110 standoff annotation files, manually annotated for instances of NFL Scoring annotation categories defined with respect to a NFL Scoring ontology.

The Machine Reading program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the NFL Scoring Use Cases evaluation, which tested the sports domain by extracting information about scoring events and game outcomes and aligning that information with an NFL Scoring ontology.

Machine Reading Phase 1 NFL Scoring Training Data is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, July 15, 2019

LDC 2019 July Newsletter

In this newsletter:

Fall 2019 LDC Data Scholarship Program

LDC data and commercial technology development

New Publications:
The DKU-JNU-EMA Electromagnetic Articulography Database
Phrase Detectives Corpus Version 2
First DIHARD Challenge Evaluation - Nine Sources
First DIHARD Challenge Evaluation – SEEDLingS
__________________________________________________________

Fall 2019 LDC Data Scholarship Program

Student applications for the Fall 2019 LDC Data Scholarship program are being accepted now through September 15, 2019. This scholarship program provides eligible students with access to LDC data at no cost. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
__________________________________________________________

New publications:

(1) The DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for each dialect.

Articulatory measurements were made using the NDI electromagnetic articulography wave research system to capture real-time vocal tract variable trajectories. Subjects had six sensors placed in various locations in their mouth and one reference sensor was placed on the bridge of their nose. For simultaneous recording of speech signals, subjects also wore a head-mounted close-talk microphone.

Speakers engaged in four different types of recording sessions: one in which they read complete sentences or short texts, and three sessions in which they read related words of a specific common consonant, vowel or tone.

DKU-JNU-EMA Electromagnetic Articulography Database is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000.

(2) Phrase Detectives Corpus Version 2 was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the Phrase Detectives Game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference.

This release constitutes a new version of the Phrase Detectives Corpus (LDC2017T08), adding significantly more annotated tokens to the data set and supplying players’ judgments and a silver label annotation based on the probabilistic aggregation method for anaphoric information for each markable.

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. The annotation is a simplified form of the coding scheme used in The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Phrase Detectives Corpus Version 2 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(3) First DIHARD Challenge Evaluation - Nine Sources was developed by LDC and contains approximately 18 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge.

The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions as follows (all sources are in English unless otherwise indicated):

Autism Diagnostic Observation Schedule (ADOS) interviews
Conversations in Restaurants
DCIEM/HCRC map task (LDC96S38)
Audiobook recordings from LibriVox
Meeting speech collected by LDC in 2001 for the ROAR project (see, e.g., ISL Meeting Speech Part 1 (LDC2004S05))
2001 U.S. Supreme Court oral arguments
Mixer 6 Speech (LDC2013S02)
Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project
YouthPoint radio interviews

This release, when combined with First DIHARD Challenge Evaluation - SEEDLingS (LDC2019S13), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation - Nine Sources is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $300.

(4) First DIHARD Challenge Evaluation – SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge.

The source data was drawn from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings for SEEDLingS were generated in the home environment of 44 infants from 6-18 months of age in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge.

This release, when combined with First DIHARD Challenge Evaluation - Nine Sources (LDC2019S12), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation – SEEDLingS is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $50.