New publications
LDC2012T16
LDC2012T15
The Future of Language
Resources: LDC 20th Anniversary Workshop Summary
Thanks to the members,
friends and staff who made
our 20th Anniversary Workshop (September 6-7) a fruitful and fun
experience. The speakers -- from academia, industry and government
– engaged participants and provoked discussion with their talks
about the ways in which language resources contribute to research
in language-related fields and other disciplines and with their
insights into the future. The result was much food for thought as
we enter our third decade.
Visit the workshop
page for the proceedings and to learn more about the event.
English Treebanking at LDC
As part of our 20th anniversary celebration, the coming newsletters
will include features that provide an overview of the broad
range of LDC’s activities. This month, we'll examine English
treebanking efforts at LDC. The English treebanking team is lead
by Ann Bies, Senior Research Coordinator. The association of treebanks
with LDC began with the publication of the original Penn English
Treebank (Treebank-2)
in 1995. Since that time
the need for new varieties of English treebank data has continued
to grow, and LDC has expanded its expertise to address new
research challenges. This
includes the development of treebanked data for additional domains
including conversational speech and web text as well as the
creation of parallel treebank data.
Speech data presents unique
challenges not inherent in edited text such as speech disfluency
and hesitations. Penn
Treebank contains conversational speech data from the Switchboardtelephone
collection which has been tagged, dysfluency-annotated, and
parsed. LDC’s more recent publication, English CTS Treebank with Structural
Metadata,
builds on that annotation and includes new data. The development
of that corpus was motivated by the need to have both structural
metadata and syntactic structure annotated in order to support
work on speech parsing and structural event detection. The
annotation involved a two-pass approach to annotating metadata,
speech effects and syntactic structure in transcribed
conversational speech: separately annotating for structural
metadata, or structural events, and for syntactic structure. The
two annotations were then combined into a single aligned
representation.
Also recently, LDC has undertaken
complex syntactic annotation of data collected over the web. Since most parsers are
trained using newswire, they achieve better accuracy on similar
heavily edited texts. LDC,
through a gift from Google Inc., developed English
Web
Treebank to improve parsing, translation and information
extraction on unedited domains, such as blogs, newsgroups, and
consumer reviews. LDC’s
annotation guidelines were adapted to handle unique features of
web text such as inconsistent punctuation and capitalization as
well as the increased use of slang, technical jargon and
ungrammatical sentences.
LDC and its research partners are
also involved in the creation of parallel treebanks used for word
alignment tasks. Parallel
treebanks are annotated morphological and syntactic structures
that are aligned at sentence as well as sub-sentence levels. These
resources are used for improving machine translation quality. To
create such treebanks, English files (translated from
the source Arabic or Chinese) are first automatically part-of-speech
tagged and parsed and then hand-corrected at each stage. The quality control process
consists of a series of specific searches for over 100 types of
potential inconsistency and parser or annotation error. Parallel treebank data in the
LDC catalog includes the English
Translation
Treebank: An Nahar Newswire whose files are parallel with
those in Arabic
Treebank:
Part 3 v 3.2
English treebanking at
LDC is ongoing; new titles are in progress and will be added to
our catalog.
New Publications
(1) GALE
Chinese-English
Word Alignment and Tagging Training Part 1 -- Newswire and Web
was developed by LDC and contains 150,068 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags.
This material was used as training data in the DARPA GALE
(Global Autonomous Language Exploitation) program. This release consists of Chinese
source newswire and web data (newsgroup, weblog) collected by LDC
in 2008.
Some approaches to statistical machine
translation include the incorporation of linguistic knowledge in
word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word
tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the
alignment annotation.
The Chinese word alignment tasks consisted of
the following components:
-Identifying, aligning, and tagging 8 different
types of links
-Identifying, attaching, and tagging
local-level unmatched words
-Identifying and tagging
sentence/discourse-level unmatched words
-Identifying and tagging all instances of
Chinese 的
(DE) except when they were a part of a semantic link.
GALE Chinese-English Word Alignment and Tagging
Training Part 1 -- Newswire and Web is distributed via web
download. 2012 Subscription Members will automatically
receive two copies of this data on CD. 2012 Standard Members
may request a copy as part of their 16 free membership corpora.
*
(2) MADCAT
Phase 1 Training Set contains all training data created by
LDC to support Phase 1 of the DARPA MADCAT Program. The data in
this release consists of handwritten Arabic documents scanned at
high resolution and annotated for the physical coordinates of each
line and token. Digital transcripts and English translations of
each document are also provided, with the various content and
annotation layers integrated in a single MADCAT XML output.
The goal of the MADCAT program is to
automatically convert foreign text images into English
transcripts. MADCAT Phase 1 data was collected by LDC from Arabic
source documents in three genres: newswire, weblog and newsgroup
text. Arabic speaking "scribes" copied documents by hand,
following specific instructions on writing style (fast, normal,
careful), writing implement (pen, pencil) and paper (lined,
unlined). Prior to assignment, source documents were processed to
optimize their appearance for the handwriting task, which resulted
in some original source documents being broken into multiple
"pages" for handwriting. Each resulting handwritten page was
assigned to up to five independent scribes, using different
writing conditions.
The handwritten, transcribed documents were checked for quality and
completeness, then each page was scanned at a high resolution (600
dpi, greyscale) to create a digital version of the handwritten
document. The scanned images were then annotated to indicate the
physical coordinates of each line and token. Explicit reading
order was also labeled, along with any errors produced by the
scribes when copying the text.
The final step was to produce a unified data
format that takes multiple data streams and generates a single xml
output file which contains all required information. The resulting
xml file has these
distinct components: a text layer that consists of the source
text, tokenization and sentence segmentation; an image layer that
consist of bounding boxes; a scribe demographic layer that
consists of scribe ID and partition (train/test); and a document
metadata layer. This release includes 9693 annotation files in
MADCAT XML format (.madcat.xml) along with their corresponding
scanned image files in TIFF format.
MADCAT Phase 1 Training Set is distributed on
two DVD-ROM. 2012 Subscription Members will automatically
receive two copies of this data. 2012 Standard Members may
request a copy as part of their 16 free membership corpora.