New LDC Website Coming Soon
LDC Spoken Language Sampler - 2nd Release
New publications:
GALE Phase 2 Arabic Broadcast Conversation Speech Part 2
GALE Phase 2 Arabic Broadcast Conversation Transcripts Part
2
Semantic Textual Similarity (STS) 2013 Machine Translation
New
LDC Website Coming Soon
Look for LDC's new website in the coming weeks. We've revamped the design and site plan to make it easier than ever to find what you're looking for. The features you use the most -- the catalog, new corpus releases and user login -- will be a short click away. We expect the LDC website to be occasionally unavailable for a few days at the end of September as we make the switch and thank you in advance for your understanding.
LDC
Spoken Language Sampler - 2nd Release
The LDC Spoken Language Sampler – 2nd Release is now
available. It contains speech and transcript samples from recent releases
and is available at no cost. Follow the link above to the catalog page,
download and browse.
New publications:
(1) GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program. The data was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.
LDC's local broadcast collection
system is highly automated, easily extensible and robust and capable of
collecting, processing and evaluating hundreds of hours of content from several
dozen sources per day. The broadcast material is served to the system by a set
of free-to-air (FTA) satellite receivers, commercial direct satellite systems
(DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable
television (CATV) feeds. The mapping between receivers and recorders is dynamic
and modular; all signal routing is performed under computer control, using a
256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format
and are then processed to extract audio, to generate keyframes and compressed
audio/video, to produce time-synchronized closed captions (in the case of North
American English) and to generate automatic speech recognition (ASR) output.
The broadcast conversation
recordings in this release feature interviews, call-in programs and round table
discussions focusing principally on current events from several sources. This
release contains 141 audio files presented in .wav, 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Arabic speaker following Audit
Procedure Specification Version 2.0 which is included in this release.
GALE Phase 2 Arabic Broadcast
Conversation Speech Part 2 is distributed on 2 DVD-ROM.
2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data fora fee
2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data fora fee
*
(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts Part
2 was developed by LDC and contains transcriptions of approximately
128 hours of Arabic broadcast conversation speech collected in 2007 by LDC,
MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA
GALE (Global Autonomous Language Exploitation) program. The source broadcast
conversation recordings feature interviews, call-in programs and round table
discussions focusing principally on current events from several sources.
The transcript files are in
plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 763,945 tokens. The transcripts were created with the LDC-developed
transcription tool, XTrans, a multi-platform, multilingual,
multi-channel transcription tool that supports manual transcription and
annotation of audio recordings.
The files in this corpus were
transcribed by LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich
transcription specification (QRTR) both of which are included in the
documentation with this release. QTR transcription consists of quick
(near-)verbatim, time-aligned transcripts plus speaker identification with
minimal additional mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries and manual
sentence unit annotation to the core components of a quick transcript.
GALE Phase 2 Arabic Broadcast
Conversation Transcripts - Part 2 is distributed via web download.
2013 Subscription Members will
automatically receive two copies of this data on disc. 2013 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(3) Semantic Textual Similarity (STS) 2013 Machine Translation
was developed as part of the STS 2013 Shared Task which was held in conjunction
with *SEM 2013, the second joint conference on lexical and
computational semantics organized by the ACL (Association of Computational
Linguistics) interest groups SIGLEX and SIGSEM.
It is comprised of one text file containing 750 English sentence pairs
translated from the Arabic and Chinese newswire and web data sources.
The goal of the Semantic Textual
Similarity (STS) task was to create a unified framework for the evaluation of
semantic textual similarity modules and to characterize their impact on natural
language processing (NLP) applications. STS measures the degree of semantic
equivalence. The STS task was proposed as an attempt at creating a unified
framework that allows for an extrinsic evaluation of multiple semantic
components that otherwise have historically tended to be evaluated
independently and without characterization of impact on NLP applications. More
information is available at the STS 2013 Shared Task homepage.
The source data is Arabic and
Chinese newswire and web data collected by LDC that was translated and used in
the DARPA GALE (Global Autonomous Language Exploitation) program and in several
NIST Open Machine Translation evaluations. Of the 750 sentence pairs, 150 pairs
are from the GALE Phase 5 collection and 600 pairs are from NIST 2008-2012 Open
Machine Translation (OpenMT) Progress Test Sets (LDC2013T07).
The data was built to identify
semantic textual similarity between two short text passages. The corpus is
comprised of two tab delimited sentences per line. The first sentence is a
translation and the second sentence is a post-edited translation. Post-editing
is a process to improve machine translation with a minimum of manual labor. The
gold standard similarity values and other STS datasets can be obtained from the
STS homepage, linked above.
Semantic Textual Similarity (STS)
2013 Machine Translation is distributed via web download.
2013 Subscription Members will
automatically receive two copies of this data on disc. 2013 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members
may request this data by submitting a signed copy of LDC User Agreement for Non-members. This
data is available at no-cost.