Linguistic Data Consortium: February 2016

Only two weeks left to enjoy 2016 membership savings

Spring 2016 LDC Data Scholarship recipients

How to Share Data through LDC webinar on YouTube

New publications:

BOLT Chinese Discussion Forums

GALE Phase 3 Arabic Broadcast Conversation Speech Part 2

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2

_______________________________________________________________________

Only two weeks left to enjoy 2016 membership savings

There’s still time to save on 2016 membership fees. Now through March 1, all organizations receive a 5% discount when they join for MY2016. MY2015 members are eligible for an additional 5% off the fee (10% total savings) when they renew before March 1.

To join, create or sign into your LDC user account, select your preferred membership type from the Catalog, add the item to your bin and follow the check-out process. The Membership Office will apply any discounts. Alternatively, if you have already received a renewal invoice from LDC, you can simply pay against that.

For more information on the benefits of membership, visit Join LDC.

Spring 2016 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Spring 2016 data scholarships:

Shefali Waldekar: Indian Institute of Technology Kharagpur (India), PhD Candidate, Electronics and Electrical Communications Engineering. Shefali is awarded copies of 2002 Rich Transcription Broadcast News and Conversational Telephone Speech and 2005 Spring NIST Rich Transcription (RT05-S) Evaluation Set for her research in audio diarization.

Nikola Invanov Nikolov: University of Zurich and ETH Zurich (Switzerland), MSc candidate in Informatics. Nikola is awarded a copy of Annotated English Gigaword for his research in text summarization.

Om Prakash Singh: Indian Institute of Technology, Guwahati (India), Research scholar in spoken language identification. Om is awarded a copy of NIST Language Recognition Evaluation Test Set for his work in language identification.

Moshen Mohammadi: Iranian Research Institute for Electrical Engineering (Iran), PhD Candidate in Communications. Moshen is awarded copies of the 2008 NIST Speaker Recognition Evaluation Training Sets 1 and 2, the Evaluation Test Set and the Supplemental Set for his work in speaker recognition in noisy environments.

For program information visit the Data Scholarship page.

How to Share Data through LDC webinar on YouTube

LDC’s first webinar, How to Share Data through LDC, is now available for viewing on our YouTube page. Presented live on January 22, 2016, the webinar outlined in easy steps the process for submitting language resources to LDC for publication in the Catalog. In addition, discussion topics included the benefits of sharing data through LDC, the corpus life cycle, data delivery, quality control and more.

New Corpora

(1) BOLT Chinese Discussion Forums was developed by LDC and consists of 1,597,500 discussion forum threads in Chinese harvested from the Internet using a combination of manual and automatic processes.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. The material in this release represents the Chinese source data in the discussion forum genre.

Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Chinese content. Language identification was performed on all threads in this corpus (using CLD2), and threads for which the results indicated a high probability of largely non-Chinese content are identified in this release.

BOLT Chinese Discussion Forums is distributed via web download as a multi-part zip file. Consult the Using LDC Data page (https://www.ldc.upenn.edu/data-management/using) for more information about this format.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 (LDC2016T06).

These broadcast conversation recordings feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 142 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 is distributed via web download.

(3) GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 (LDC2016S01).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 845,791 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 is distributed via web download.

Linguistic Data Consortium

Monday, February 15, 2016

LDC 2016 February Newsletter