The Future of Language Resources: LDC 20th Anniversary Workshop Summary
Thanks to the members, friends and staff who made our 20th Anniversary Workshop (September 6-7) a fruitful and fun experience. The speakers -- from academia, industry and government – engaged participants and provoked discussion with their talks about the ways in which language resources contribute to research in language-related fields and other disciplines and with their insights into the future. The result was much food for thought as we enter our third decade.
Visit the workshop page for the proceedings and to learn more about the event.
English Treebanking at LDC
As part of our 20th anniversary celebration, the coming newsletters will include features that provide an overview of the broad range of LDC’s activities. This month, we'll examine English treebanking efforts at LDC. The English treebanking team is lead by Ann Bies, Senior Research Coordinator. The association of treebanks with LDC began with the publication of the original Penn English Treebank (Treebank-2) in 1995. Since that time the need for new varieties of English treebank data has continued to grow, and LDC has expanded its expertise to address new research challenges. This includes the development of treebanked data for additional domains including conversational speech and web text as well as the creation of parallel treebank data.
Speech data presents unique challenges not inherent in edited text such as speech disfluency and hesitations. Penn Treebank contains conversational speech data from the Switchboardtelephone collection which has been tagged, dysfluency-annotated, and parsed. LDC’s more recent publication, English CTS Treebank with Structural Metadata, builds on that annotation and includes new data. The development of that corpus was motivated by the need to have both structural metadata and syntactic structure annotated in order to support work on speech parsing and structural event detection. The annotation involved a two-pass approach to annotating metadata, speech effects and syntactic structure in transcribed conversational speech: separately annotating for structural metadata, or structural events, and for syntactic structure. The two annotations were then combined into a single aligned representation.
Also recently, LDC has undertaken complex syntactic annotation of data collected over the web. Since most parsers are trained using newswire, they achieve better accuracy on similar heavily edited texts. LDC, through a gift from Google Inc., developed English Web Treebank to improve parsing, translation and information extraction on unedited domains, such as blogs, newsgroups, and consumer reviews. LDC’s annotation guidelines were adapted to handle unique features of web text such as inconsistent punctuation and capitalization as well as the increased use of slang, technical jargon and ungrammatical sentences.
LDC and its research partners are also involved in the creation of parallel treebanks used for word alignment tasks. Parallel treebanks are annotated morphological and syntactic structures that are aligned at sentence as well as sub-sentence levels. These resources are used for improving machine translation quality. To create such treebanks, English files (translated from the source Arabic or Chinese) are first automatically part-of-speech tagged and parsed and then hand-corrected at each stage. The quality control process consists of a series of specific searches for over 100 types of potential inconsistency and parser or annotation error. Parallel treebank data in the LDC catalog includes the English Translation Treebank: An Nahar Newswire whose files are parallel with those in Arabic Treebank: Part 3 v 3.2
English treebanking at LDC is ongoing; new titles are in progress and will be added to our catalog.
(1) GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web was developed by LDC and contains 150,068 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. This release consists of Chinese source newswire and web data (newsgroup, weblog) collected by LDC in 2008.
Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.
The Chinese word alignment tasks consisted of the following components:
-Identifying, aligning, and tagging 8 different types of links
-Identifying, attaching, and tagging local-level unmatched words
-Identifying and tagging sentence/discourse-level unmatched words
-Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link.
GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web is distributed via web download. 2012 Subscription Members will automatically receive two copies of this data on CD. 2012 Standard Members may request a copy as part of their 16 free membership corpora.
(2) MADCAT Phase 1 Training Set contains all training data created by LDC to support Phase 1 of the DARPA MADCAT Program. The data in this release consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output.
The goal of the MADCAT program is to automatically convert foreign text images into English transcripts. MADCAT Phase 1 data was collected by LDC from Arabic source documents in three genres: newswire, weblog and newsgroup text. Arabic speaking "scribes" copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple "pages" for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions.
The handwritten, transcribed documents were checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.
The final step was to produce a unified data format that takes multiple data streams and generates a single xml output file which contains all required information. The resulting xml file has these distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. This release includes 9693 annotation files in MADCAT XML format (.madcat.xml) along with their corresponding scanned image files in TIFF format.
MADCAT Phase 1 Training Set is distributed on two DVD-ROM. 2012 Subscription Members will automatically receive two copies of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora.