Early renewing members save on fees
Commercial use and LDC data
New
publications:
Abstract Meaning Representation (AMR) Annotation Release 1.0
Abstract Meaning Representation (AMR) Annotation Release 1.0
ETS Corpus of Non-Native Written English
GALE Phase 2 Chinese Broadcast News Parallel Text Part 2
MADCAT Chinese Pilot Training Set
LDC at ACL 2014: June 23-25,
Baltimore, MD
ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Baltimore, MD. LDC’s exhibition table will feature information on new developments at the consortium and some interesting giveaways.
LDC’s Seth Kulick will present research results on “Parser Evaluation Using Derivation Trees: A Complement to evalb” (SP88) during Tuesday’s Long Paper, Short Paper, Poster & Dinner Session II (June 24, 16:50-19:20). This paper was coauthored by LDCers Ann Bies, Justin Mott, and Mark Liberman and Penn linguists Anthony Kroch and Beatrice Santorini.
ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Baltimore, MD. LDC’s exhibition table will feature information on new developments at the consortium and some interesting giveaways.
LDC’s Seth Kulick will present research results on “Parser Evaluation Using Derivation Trees: A Complement to evalb” (SP88) during Tuesday’s Long Paper, Short Paper, Poster & Dinner Session II (June 24, 16:50-19:20). This paper was coauthored by LDCers Ann Bies, Justin Mott, and Mark Liberman and Penn linguists Anthony Kroch and Beatrice Santorini.
LDC staff will also participate in the
post-conference 2nd Workshop on EVENTS: Definition,
Detection, Coreference and Representation on Friday, June 27, https://sites.google.com/site/wsevents2014/home
with presentations at the poster session:
· Inter-annotator Agreement for ERE annotation: Seth Kulick, Ann Bies and Justin Mott· A Comparison of the Events and Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards: Stephanie Strassel, Zhiyi Song, Joe Ellis (all LDC) and Jacqueline Aquilar, Charley Beller, Paul McNamee, Benjamin van Durme
Early renewing members save
on fees
LDC's early renewal discount program has resulted in significant savings for Membership Year (MY) 2014 members!The 100 organizations that renewed their membership or joined early for MY2014 saved over US$60,000 on membership fees. MY2013 members can still take advantage of savings and are eligible for a 5% discount when renewing for MY2014. This discount will apply throughout 2014.
Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora. For-profit members can use most LDC data for commercial applications.
LDC's early renewal discount program has resulted in significant savings for Membership Year (MY) 2014 members!The 100 organizations that renewed their membership or joined early for MY2014 saved over US$60,000 on membership fees. MY2013 members can still take advantage of savings and are eligible for a 5% discount when renewing for MY2014. This discount will apply throughout 2014.
Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora. For-profit members can use most LDC data for commercial applications.
Commercial use and LDC data
For-profit organizations are reminded that an
LDC membership is a pre-requisite for obtaining a commercial
license to almost all LDC databases. Non-member organizations,
including non-member for-profit organizations, cannot use LDC data
to develop or test products for commercialization, nor can they
use LDC data in any commercial product or for any commercial
purpose. LDC data users should consult corpus-specific license
agreements for limitations on the use of certain corpora. Visit
our Licensing page for further information, https://www.ldc.upenn.edu/data-management/using/licensing.
New publications
Abstract Meaning
Representation (AMR) Annotation Release 1.0 was developed by
LDC, SDL/Language
Weaver, Inc., the University of Colorado's Center for
Computational Language and Educational Research and the Information Sciences Institute
at the University of Southern California. It contains a sembank
(semantic treebank) of over 13,000 English natural language
sentences from newswire, weblogs and web discussion forums.
AMR captures “who is doing what to whom” in a
sentence. Each sentence is paired with a graph that represents its
whole-sentence meaning in a tree-structure. AMR utilizes PropBank
frames, non-core semantic roles, within-sentence coreference,
named entity annotation, modality, negation, questions,
quantities, and so on to represent the semantic structure of a
sentence largely independent of its syntax.
The source data includes discussion forums
collected for the DARPA BOLT program, Wall Street Journal and
translated Xinhua news texts, various newswire data from NIST
OpenMT evaluations and weblog data used in the DARPA GALE program.
The following table summarizes the number of training, dev, and
test AMRs for each dataset in the release. Totals are also
provided by partition and dataset:
Dataset
|
Training
|
Dev
|
Test
|
Totals
|
BOLT DF MT
|
1061
|
133
|
133
|
1327
|
Weblog and WSJ
|
0
|
100
|
100
|
200
|
BOLT DF English
|
1703
|
210
|
229
|
2142
|
2009 Open MT
|
204
|
0
|
0
|
204
|
Xinhua MT
|
741
|
99
|
86
|
926
|
Totals
|
3709
|
542
|
548
|
4799
|
Abstract Meaning Representation (AMR) Annotation Release 1.0 is distributed via web download.
2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$300.
*
ETS Corpus of
Non-Native Written English was developed by Educational Testing Service and
is comprised of 12,100 English essays written by speakers of 11
non-English native languages as part of an international test of
academic English proficiency, TOEFL (Test of
English as a Foreign Language). The test includes reading,
writing, listening, and speaking sections and is delivered by
computer in a secure test center. This release contains 1,100
essays for each of the 11 native languages sampled from eight
topics with information about the score level (low/medium/high)
for each essay.
The corpus was developed with the specific task
of native language identification in mind, but is likely to
support tasks and studies in the educational domain, including
grammatical error detection and correction and automatic essay
scoring, in addition to a broad range of research studies in the
fields of natural language processing and corpus linguistics. For
the task of native language identification, the following division
is recommended: 82% as training data, 9% as development data and
9% as test data, split according to the file IDs accompanying the
data set.
The data is sampled from essays written in 2006
and 2007 by test takers whose native languages were Arabic,
Chinese, French, German, Hindi, Italian, Japanese, Korean,
Spanish, Telugu, and Turkish. Original raw files for 11,000 of the
12,100 tokenized files are included in this release along with
prompts (topics) for the essays and metadata about the test
takers’ proficiency level. The data is presented in UTF-8
formatted text files.
ETS Corpus of Non-Native Written English is
distributed via web download.
2014 Subscription Members will automatically
receive two copies of this data on disc provided they have
completed the user
license
agreement. 2014 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this
data for a fee.
*
GALE Phase 2
Chinese Broadcast News Parallel Text Part 2 was developed
by LDC. Along with other corpora, the parallel text in this
release comprised training data for Phase 2 of the DARPA GALE
(Global Autonomous Language Exploitation) Program. This corpus
contains Chinese source text and corresponding English
translations selected from broadcast news (BN) data collected by
LDC between 2005 and 2007 and transcribed by LDC or under its
direction.
This release includes 30 source-translation
document pairs, comprising 206,737 characters of translated
material. Data is drawn from 12 distinct Chinese BN programs
broadcast by China Central TV, a national and international
broadcaster in Mainland China; New Tang Dynasty TV, a broadcaster
based in the United States; and Phoenix TV, a Hong-Kong based
satellite television station. The broadcast news recordings in
this release focus principally on current events.
The data was transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with
Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to
several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's
Chinese to English translation guidelines. Bilingual LDC staff
performed quality control procedures on the completed
translations.
GALE Phase 2 Chinese Broadcast News Parallel
Text Part 2 is distributed via web download.
2014 Subscription Members will automatically
receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
MADCAT
(Multilingual Automatic Document Classification Analysis and
Translation) Chinese Pilot Training Set contains all
training data created by LDC to support a Chinese pilot collection
in the DARPA MADCAT Program. The data in this release consists of
handwritten Chinese documents, scanned at high resolution and
annotated for the physical coordinates of each line and token.
Digital transcripts and English translations of each document are
also provided, with the various content and annotation layers
integrated in a single MADCAT XML output.
The goal of the MADCAT program was to
automatically convert foreign text images into English
transcripts. MADCAT Chinese pilot data was collected from Chinese
source documents in three genres: newswire, weblog and newsgroup
text. Chinese speaking "scribes" copied documents by hand,
following specific instructions on writing style (fast, normal,
careful), writing implement (pen, pencil) and paper (lined,
unlined). Prior to assignment, source documents were processed to
optimize their appearance for the handwriting task, which resulted
in some original source documents being broken into multiple
"pages" for handwriting. Each resulting handwritten page was
assigned to up to five independent scribes, using different
writing conditions.
The handwritten, transcribed documents were
next checked for quality and completeness, then each page was
scanned at a high resolution (600 dpi, greyscale) to create a
digital version of the handwritten document. The scanned images
were then annotated to indicate the physical coordinates of each
line and token. Explicit reading order was also labeled, along
with any errors produced by the scribes when copying the text.
The final step was to produce a unified data
format that takes multiple data streams and generates a single
MADCAT XML output file which contains all required information.
The resulting madcat.xml file contains distinct components: a text
layer that consists of the source text, tokenization and sentence
segmentation; an image layer that consist of bounding boxes; a
scribe demographic layer that consists of scribe ID and partition
(train/test); and a document metadata layer.
This release includes 22,284 annotation files
in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml)
along with their corresponding scanned image files in TIFF format.
The annotation results in GEDI XML files include ground truth
annotations and source transcripts.
MADCAT (Multilingual Automatic Document
Classification Analysis and Translation) Chinese Pilot Training
Set is distributed on five DVD-ROM.
2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for for a fee.