Linguistic Data Consortium: January 2018

Membership Discounts for MY2018 Still Available

New Publications:

___________________________________________________________________________

Membership Discounts for MY2018 Still Available

Join LDC while membership savings are still available. Now through March 1, 2018, renewing MY2017 members will receive a 10% discount off the membership fee. New or non-consecutive member organizations will receive a 5% discount. Membership remains the most economical way to access LDC releases. This year’s planned publications include Multilanguage Conversational Telephone Speech, IARPA Babel Language Packs (telephone speech and transcripts), DIRHA (Distant-speech Interaction for Robust Home Applications), TRAD (Chinese-French and Arabic-French parallel text), data from BOLT, DEFT, LORELEI, RATS and TAC KBP, and more. Browse the Members pages for details on membership options and benefits.

New publications:

(1) DEFT Spanish Treebank was developed by LDC and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text and Latin American Spanish discussion forum data created for the DARPA Deep Exploration and Filtering of Text (DEFT) program. DEFT Spanish Treebank supported the program's goal of deep natural language understanding.

Newswire source files were selected from Spanish Gigaword Third Edition (LDC2011T12) and were manually sentence-segmented for DEFT. Discussion forum source files were selected from Spanish discussion forum source data collected by LDC, consisting of continuous multi-posts of 100-1000 words.

This release contains 114 files (54,394 tokens) of newswire data and 60 files (55,307 tokens) of discussion forum data all of which were annotated with constituents and syntactic functions.

DEFT Spanish Treebank is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. It is comprised of approximately 85 hours of real and simulated read speech by six native American English speakers. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text.

Speech was collected in a real apartment setting with typical domestic background noise and inter/intra-room reverberation effects. Annotations, speaker metadata and images of the apartment setting are also included.

DIRHA English WSJ Audio is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TRAD Chinese-French Parallel Text -- Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06).

The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.

The source data for TRAD Chinese-French Parallel Text is Chinese blog text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.

TRAD Chinese-French Parallel Text -- Blog is distributed via web download.

Linguistic Data Consortium

Tuesday, January 16, 2018

LDC January 2018 Newsletter