New publications:
2013 LDC Podcast Available
from LDC Blog
Kicking off the new year is the fourth podcast
in our 20th Anniversary series featuring LDC Senior Researcher,
Mohamed Maamouri.
Mohamed directs the Arabic Treebank group and
spearheads the development of Arabic resources and projects. The
latter includes the leading role in LDC’s collaboration with
Georgetown University Press to develop updated versions of three
dialectal Arabic dictionaries (Iraqi, Moroccan, Syrian). In this
podcast, he reflects on his personal and professional experiences
and comments on Arabic resource development at LDC.
Click here
for Mohamed’s podcast.
Other podcasts will be published via the LDC Blog, so stay
tuned to that space.
Membership
Discounts for MY 2013 Still Available
If you are considering joining for Membership
Year 2013 (MY2013), there is still time to save on membership
fees. Any organization which joins or renews membership for 2013
through Friday, March 1, 2013, is entitled to a 5% discount on
membership fees. Organizations which held membership for MY2012
can receive a 10% discount on fees provided they renew prior to
March 1, 2013. For further information on pricing, please consult
our Announcements
page or contact LDC.
Penn Discourse
Treebank Version 2.0 Update - RTE data
A Recognizing Textual Entailment (RTE) update
is now available for Penn Discourse Treebank Version 2.0 LDC2008T05
(PDTB). This data has been used to run the textual entailment
experiments described in: Sara Tonelli and Elena Cabrio "Hunting
for Entailing Pairs in the Penn Discourse Treebank", in
Proceedings of Coling 2012, Mumbay, India. The files contain Text
- Hypothesis pairs in the standard RTE xml format (for more
details, see RTE
Challenge at TAC 2011), which have been manually annotated
as entailing or not entailing. All sentence pairs have been
extracted from the Penn Discourse Treebank and are therefore
connected by a discourse relation label.
The data are not included in the general
release of Penn Discourse Treebank Version 2.0, but are freely
available for download
from the catalog page.
New Publications
(1) Chinese-English
Biology and Chemistry Abstract Parallel Text was developed
by The MITRE Corporation. It
consists of parallel sentences from a collection of chemistry and
biology-related scientific article abstracts published in Mandarin
and translated into English by translators with particular
expertise in the technical area. Translators were instructed to
err on the side of literal translation if required, but to
maintain the technical writing style of the source and make the
resulting English as natural as possible. The translators were
given specific guidelines for translation, and those are included
in this distribution.
This release contains 2,239 lines of parallel
Mandarin and English, with a total of 156,445 characters of
Mandarin and 75,515 words of English, presented in a separate
UTF-8 plain text file for each language. The sentences were
translated in sequential order and presented in scrambled order,
such that parallel sentences at identical line numbers are
translations. For example, the 31st line of the English file is a
translation of the 31st line of the Mandarin file. The original
line sequence is not provided.
Chinese-English Biology and Chemistry Abstract
Parallel Text is distributed via web download. 2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members
may request a copy as part of their 16 free membership corpora.
*
(2) GALE
Phase 2 Arabic Web Parallel Text was developed by LDC. Along
with other corpora, the parallel text in this release comprised
training data for Phase 2 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. This corpus contains Modern
Standard Arabic source text and corresponding English translations
selected from web data collected in 2007 by LDC and transcribed by
LDC or under its direction. GALE Phase 2 Arabic Web Parallel Text
includes 60 source-translation document pairs, comprising 42,089
words of Arabic source text and its English translation. Data was
drawn from various Arabic weblog and newsgroup sources.
The files in this release were transcribed by
LDC staff and/or transcription vendors under contract to LDC in
accordance with the Quick
Rich
Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to
several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDC's Arabic
to English translation guidelines.
Bilingual LDC staff performed
quality control procedures on the completed translations. Source
data and translations are distributed in TDF format. TDF files are
tab-delimited files containing one segment of text along with meta
information about that segment.
GALE Phase 2 Arabic Web Parallel Text is
distributed via web download. 2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members
may request a copy as part of their 16 free membership corpora.