Linguistic Data Consortium: LDC August 2021 Newsletter

Monday, August 16, 2021

LDC August 2021 Newsletter

LDC at Interspeech 2021

Fall 2021 LDC Data Scholarship Program

New Publications:
Wikipedia Spanish Speech and Transcripts
BOLT Egyptian Arabic SMS/Chat Parallel Training Data

LDC at Interspeech 2021

LDC will be exhibiting at Interspeech 2021 held this year, August 30 - September 3, in a hybrid in-person, virtual format. Stop by our digital booth for a look at a selection of documents and videos describing recent developments at the Consortium and new publications. You can also contact us through the conference platform to schedule a chat session.

We’ll be hosting a live virtual video event highlighting LDC’s recent speech publications during the conference. Stay tuned for scheduling information to come!

LDC work will be featured in the following conference sessions:

2011 Fearless Steps Challenge Phase-3 (FSC P3): Advancing SLT for Unseen Channel and Mission Data Across NASA Apollo Audio
Tuesday, August 31, 20:00
Session: In-person Oral: ASR Technologies and systems 19:00-21:00

Using Games to Augment Corpora for Language Recognition and Confusability
Wednesday, September 1, 16:20-16:40
Session: In-person Oral: Speaker, Language, and Privacy 16:00-18:00

The Third DIHARD Diarization Challenge
Thursday, September 2, 16:00
Session: Virtual: Speaker Diarization II 16:00-18:00

LDC will post conference links and updates via our Twitter feed and Facebook page. We hope to “see” you at Interspeech 2021!

Fall 2021 LDC Data Scholarship Program
Student applications for the Fall 2021 LDC Data Scholarship program are being accepted now through September 15, 2021. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, visit the LDC Data Scholarship page.

New publications:

(1) Wikipedia Spanish Speech and Transcripts consists of approximately 25 hours of Spanish read speech from Wikipedia Grabada, the Spanish version of WikiProject Spoken Wikipedia, and corresponding transcripts. Speakers (150 male, 43 female) read Wikipedia articles; the audio files were segmented and transcribed by native Spanish speakers. Speaker metadata is included in this release.

Wikipedia Spanish Speech and Transcripts is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic SMS/Chat Parallel Training Data was developed by LDC and consists of approximately 723,000 tokens of Egyptian Arabic SMS/Chat data collected for the DARPA BOLT program along with their corresponding English translations.

The source data was manually reviewed to exclude any messages/conversations that were not in the target language or that had sensitive content, such as personal identifying information.

Data was manually selected for translation. Messages/conversations were arranged in chronological order, segmented into sentence units and assigned to translation vendors. Translators followed LDC's BOLT translation guidelines.

BOLT Egyptian SMS/Chat Parallel Training Data is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, August 16, 2021

LDC August 2021 Newsletter

No comments:

Post a Comment