LDC at LREC 2016
New publications:
GALE Phase 4 Chinese Broadcast Conversation Transcripts
LDC at LREC 2016
LDC will attend the 10th Language Resource Evaluation
Conference (LREC2016), hosted by ELRA, the European Language Resource
Association. The conference will be held in Portorož, Slovenia from May 23-28
and features a broad range of sessions on language resources and human language
technologies research. Seven LDC staff members will be presenting current work
on topics including trends in HLT research, building language resources for
autism spectrum disorders, data management plans, rapid development of
morphological analyzers for typologically diverse languages, selection criteria
for low resource language programs, multi-language speech collection for NIST
LRE, novel incentives for collecting data and annotation from people, and more.
Following the conference, LDC’s presented papers and
posters will be available on LDC’s Papers
New Corpora
(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing consists of data, tools, system results, and publications
associated with the 2014 and 2015 tasks on Broad-Coverage Semantic Dependency
Parsing (SDP) conducted in conjunction with the International Workshop
on Semantic Evaluation (SemEval) and was developed by the SDP task organizers.
This release is based on English, Chinese and Czech data from the following resources: Treebank-2 LDC95T17, Proposition Bank I LDC2004T14, NomBaank v 1.0 LDC2008T23 and CCGBank LDC2005T13 (English); Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21) (Chinese); and Prague Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01) (Czech).
The results are presented as graphs in three target representations: MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures (PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional target representation CCGbank was converted to semantic dependency graphs (in the subdirectory ‘ccd’).
SDP 2014 & 2015: Broad
Coverage Semantic Dependency Parsing is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.
Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast Conversation Transcripts (LDC2016T12).
The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 236 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.
GALE Phase 4 Chinese Broadcast
Conversation Speech is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.
Corresponding audio data is released as GALE Phase 4 Chinese Broadcast Conversation Speech (LDC2016S03).
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 2,259,952 tokens.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR). QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. QRTR adds additional structural information such as topic boundaries and manual sentence unit annotation.
GALE Phase 4 Chinese Broadcast
Conversation Transcripts is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.