Linguistic Data Consortium: June 2014

LDC at ACL 2014: June 23-25, Baltimore, MD
Early renewing members save on fees

Commercial use and LDC data

New publications:
Abstract Meaning Representation (AMR) Annotation Release 1.0

ETS Corpus of Non-Native Written English

GALE Phase 2 Chinese Broadcast News Parallel Text Part 2

MADCAT Chinese Pilot Training Set

LDC at ACL 2014: June 23-25, Baltimore, MD
ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Baltimore, MD. LDC’s exhibition table will feature information on new developments at the consortium and some interesting giveaways.
LDC’s Seth Kulick will present research results on “Parser Evaluation Using Derivation Trees: A Complement to evalb” (SP88) during Tuesday’s Long Paper, Short Paper, Poster & Dinner Session II (June 24, 16:50-19:20). This paper was coauthored by LDCers Ann Bies, Justin Mott, and Mark Liberman and Penn linguists Anthony Kroch and Beatrice Santorini.

LDC staff will also participate in the post-conference 2^nd Workshop on EVENTS: Definition, Detection, Coreference and Representation on Friday, June 27, https://sites.google.com/site/wsevents2014/home with presentations at the poster session:

· Inter-annotator Agreement for ERE annotation: Seth Kulick, Ann Bies and Justin Mott

· A Comparison of the Events and Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards: Stephanie Strassel, Zhiyi Song, Joe Ellis (all LDC) and Jacqueline Aquilar, Charley Beller, Paul McNamee, Benjamin van Durme

Early renewing members save on fees

LDC's early renewal discount program has resulted in significant savings for Membership Year (MY) 2014 members!The 100 organizations that renewed their membership or joined early for MY2014 saved over US$60,000 on membership fees. MY2013 members can still take advantage of savings and are eligible for a 5% discount when renewing for MY2014. This discount will apply throughout 2014.
Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora. For-profit members can use most LDC data for commercial applications.

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information, https://www.ldc.upenn.edu/data-management/using/licensing.

New publications

Abstract Meaning Representation (AMR) Annotation Release 1.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Center for Computational Language and Educational Research and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 13,000 English natural language sentences from newswire, weblogs and web discussion forums.

AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

The source data includes discussion forums collected for the DARPA BOLT program, Wall Street Journal and translated Xinhua news texts, various newswire data from NIST OpenMT evaluations and weblog data used in the DARPA GALE program. The following table summarizes the number of training, dev, and test AMRs for each dataset in the release. Totals are also provided by partition and dataset:

Dataset	Training	Dev	Test	Totals
BOLT DF MT	1061	133	133	1327
Weblog and WSJ	0	100	100	200
BOLT DF English	1703	210	229	2142
2009 Open MT	204	0	0	204
Xinhua MT	741	99	86	926
Totals	3709	542	548	4799

Abstract Meaning Representation (AMR) Annotation Release 1.0 is distributed via web download.
2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$300.

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.

The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.

The data is sampled from essays written in 2006 and 2007 by test takers whose native languages were Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. Original raw files for 11,000 of the 12,100 tokenized files are included in this release along with prompts (topics) for the essays and metadata about the test takers’ proficiency level. The data is presented in UTF-8 formatted text files.

ETS Corpus of Non-Native Written English is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc provided they have completed the user license agreement. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast news (BN) data collected by LDC between 2005 and 2007 and transcribed by LDC or under its direction.

This release includes 30 source-translation document pairs, comprising 206,737 characters of translated material. Data is drawn from 12 distinct Chinese BN programs broadcast by China Central TV, a national and international broadcaster in Mainland China; New Tang Dynasty TV, a broadcaster based in the United States; and Phoenix TV, a Hong-Kong based satellite television station. The broadcast news recordings in this release focus principally on current events.

The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Chinese Pilot Training Set contains all training data created by LDC to support a Chinese pilot collection in the DARPA MADCAT Program. The data in this release consists of handwritten Chinese documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output.

The goal of the MADCAT program was to automatically convert foreign text images into English transcripts. MADCAT Chinese pilot data was collected from Chinese source documents in three genres: newswire, weblog and newsgroup text. Chinese speaking "scribes" copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple "pages" for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions.

The handwritten, transcribed documents were next checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.

The final step was to produce a unified data format that takes multiple data streams and generates a single MADCAT XML output file which contains all required information. The resulting madcat.xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer.

This release includes 22,284 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml) along with their corresponding scanned image files in TIFF format. The annotation results in GEDI XML files include ground truth annotations and source transcripts.

MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Chinese Pilot Training Set is distributed on five DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for for a fee.

Linguistic Data Consortium

Thursday, June 19, 2014

LDC June 2014 Newsletter