Linguistic Data Consortium: December 2018

LDC Membership Discounts for MY2019 Still Available

Spring 2019 LDC Data Scholarship Program - deadline approaching

New publications:

HUB5 Mandarin Telephone Speech and Transcripts Second Edition
Nautilus Speaker CharacterizationTAC Relation Extraction Dataset
_______________________________________________________________

LDC Membership Discounts for MY2019 Still Available

Join LDC while membership savings are still available. Now through March 1, 2019, renewing MY2018 members will receive a 10% discount off the membership fee. New or non-consecutive member organizations will receive a 5% discount. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

Spring 2019 LDC Data Scholarship Program - deadline approaching

Students can apply for the Spring 2019 Data Scholarship Program now through January 15, 2019. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships.

New publications:

(1) HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by LDC in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second edition merges the speech and transcript releases, updates the audio format, and adds Pinyin transcripts, forced alignment, and updated documentation and metadata.

This corpus contains approximately 19 hours of Mandarin speech from 42 unscripted telephone conversations between native speakers of Mandarin from CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been released in a second, updated edition (LDC2018S09) and (2) associated transcripts of contiguous 5-30 minute segments from those telephone conversations.

Participants could speak with a person of their choice on any topic; most called family members and friends. The recorded conversations lasted up to 30 minutes. Transcripts were created manually by native Mandarin speakers in the GB2312 encoding schema. This release includes Pinyin transcripts and the original transcripts, both in UTF-8 format.

HUB5 Mandarin Telephone Speech and Transcripts Second Edition is available via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Nautilus Speaker Characterization was developed at the Technical University of Berlin and is comprised of approximately 155 hours of conversational speech from 300 German speakers aged 18 to 35 years (126 males and 174 females) with no marked dialect or accent, recorded in an acoustically-isolated room. The corpus was designed to support research on the detection of speaker social characteristics, such as personality, charisma, and voice attractiveness.

Four scripted and four semi-spontaneous dialogs simulating telephone call inquiries were elicited from the speakers. Additionally, spontaneous neutral and emotional speech utterances (predominantly excitement or frustration) and questions were produced.

Speech corresponding to one of the semi-spontaneous dialogs was evaluated with respect to 34 continuous numeric labels of perceived interpersonal speaker characteristics (such as likable, attractive, competent, childish). For a set of 20 selected "extreme" speakers evaluated for their warmth-attractiveness, 34 naive voice descriptions (such as bright, creaky, articulate, melodious) were also evaluated. The corpus contains all labels, together with the speech recordings and the speakers' metadata (e.g., age, gender, place of birth, chronological places of residence and duration of stay, parents' place of birth, self-assessed personality).

Nautilus Speaker Characterization is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(3) TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The annotations were derived from TAC KBP relation types (see the guidelines), from human annotations developed by LDC and from crowdsourcing using Mechanical Turk.

Source corpora used for this dataset were TAC KBP Comprehensive English Source Corpora 2009-2014 (LDC2018T03) and TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 (LDC2018T22). For detailed information about the dataset and benchmark results, please refer to the TACRED paper.

TAC Relation Extraction Dataset is available via web download.

Linguistic Data Consortium

Monday, December 17, 2018

LDC 2018 December Newsletter