Linguistic Data Consortium: May 2016

LDC at LREC 2016

New publications:

SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing

GALE Phase 4 Chinese Broadcast Conversation Speech
GALE Phase 4 Chinese Broadcast Conversation Transcripts
_______________________________________________________________

LDC at LREC 2016

LDC will attend the 10th Language Resource Evaluation Conference (LREC2016), hosted by ELRA, the European Language Resource Association. The conference will be held in Portorož, Slovenia from May 23-28 and features a broad range of sessions on language resources and human language technologies research. Seven LDC staff members will be presenting current work on topics including trends in HLT research, building language resources for autism spectrum disorders, data management plans, rapid development of morphological analyzers for typologically diverse languages, selection criteria for low resource language programs, multi-language speech collection for NIST LRE, novel incentives for collecting data and annotation from people, and more.

Following the conference, LDC’s presented papers and posters will be available on LDC’s Papers Page.

New Corpora

(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing consists of data, tools, system results, and publications associated with the 2014 and 2015 tasks on Broad-Coverage Semantic Dependency Parsing (SDP) conducted in conjunction with the International Workshop on Semantic Evaluation (SemEval) and was developed by the SDP task organizers.

SemEval is an ongoing series of evaluations of computational semantic analysis systems intended to explore the nature of meaning in language. It evolved from the Senseval word sense disambiguation series to include semantic analysis tasks outside of word sense disambiguation.

This release is based on English, Chinese and Czech data from the following resources: Treebank-2 LDC95T17, Proposition Bank I LDC2004T14, NomBaank v 1.0 LDC2008T23 and CCGBank LDC2005T13 (English); Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21) (Chinese); and Prague Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01) (Czech).

The results are presented as graphs in three target representations: MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures (PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional target representation CCGbank was converted to semantic dependency graphs (in the subdirectory ‘ccd’).

SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 4 Chinese Broadcast Conversation Speech was developed by LDC and is comprised of approximately 172 hours of Mandarin Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast Conversation Transcripts (LDC2016T12).

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 236 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 4 Chinese Broadcast Conversation Speech is distributed via web download.

(3) GALE Phase 4 Chinese Broadcast Conversation Transcripts was developed by LDC and contains transcriptions of approximately 172 hours of Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast Conversation Speech (LDC2016S03).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 2,259,952 tokens.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR). QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. QRTR adds additional structural information such as topic boundaries and manual sentence unit annotation.

GALE Phase 4 Chinese Broadcast Conversation Transcripts is distributed via web download.

Linguistic Data Consortium

Monday, May 16, 2016

LDC May 2016 Newsletter