Spring 2014 LDC Data
Scholarship Program - deadline approaching
LDC to close for
Winter Break
New publications:
The deadline for the Spring 2014 LDC Data
Scholarship Program is right around the corner. Student
applications are being accepted now through January 15, 2014,
11:59PM EST. The LDC Data Scholarship program provides university
students with access to LDC data at no cost. This program is open
to students pursuing both undergraduate and graduate studies in an
accredited college or university. LDC Data Scholarships are not
restricted to any particular field of study; however, students
must demonstrate a well-developed research agenda and a bona fide
inability to pay.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.
LDC will be closed from Wednesday, December 25,
2013 through Wednesday, January 1, 2014 in accordance with the
University of Pennsylvania Winter Break Policy. Our offices will
reopen on Thursday, January 2, 2014. Requests received for
membership renewals and corpora during the Winter Break will be
processed at that time.
Best wishes for a happy holiday season!
Best wishes for a happy holiday season!
New publications
GALE
Chinese-English Word Alignment and Tagging -- Broadcast Training
Part 1 was developed by LDC and contains 179,842 tokens of
word aligned Chinese and English parallel text enriched with
linguistic tags. This material was used as training data in the DARPA
GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine
translation include the incorporation of linguistic knowledge in
word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word
tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the
alignment annotation.
This release consists of Chinese source
broadcast conversation (BC) and broadcast news (BN) programming
collected by LDC in 2005 - 2007.
The Chinese word alignment tasks consisted of
the following components:
- Identifying, aligning, and tagging 8 different types of links
- Identifying, attaching, and tagging local-level unmatched words
- Identifying and tagging sentence/discourse-level unmatched words
- Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.
GALE Chinese-English Word Alignment and Tagging
-- Broadcast Training Part 1 is distributed via web download. 2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee
*
Maninkakan
Lexicon was developed by LDC and contains 5,834 entries of
the Maninkakan language presented as a Maninkakan-English lexicon
and a Maninkakan-French lexicon. It is the second publication in
an ongoing LDC project to to build an electronic dictionary of
four Mandekan languages: Mawukakan, Maninkakan, Bambara and Jula.
These are Eastern Manding languages in the Mande Group of the
Niger-Congo language family. LDC released a Mawukakan Lexicon (LDC2005L01)
in 2005.
More information about LDC’s work in the
languages of West Africa and the challenges those languages
present for language resource development can be found here.
Maninkakan is written using Latin script,
Arabic script and the NKo alphabet.
This lexicon is presented using a Latin-based transcription system
because the Latin alphabet is familiar to the majority of Mandekan
language speakers and because it is expected to facilitate the
work of researchers interested in this resource.
The dictionary is provided in two formats,
Toolbox and XML. Toolbox is a
version of the widely used SIL Shoebox
program adapted to display Unicode. The Toolbox files are
provided in two fonts, Arial and Doulous SIL. The Arial files
should display using the Arial font which is standard on most
operating systems. Doulous
SIL, available as a free download, is a robust font that
should display all characters without issue. Users should launch
Toolbox using the *.prj files in the Arial or Doulous_SIL folders.
Maninkakan Lexicon is distributed via web
download. 2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
The ARRAU
(Anaphora Resolution and Underspecification) Corpus of Anaphoric
Information was developed by the University of Essex and
the University of Trento. It
contains annotations of multi-genre English texts for anaphoric
relations with information about agreement and explicit
representation of multiple antecedents for ambiguous anaphoric
expressions and discourse antecedents for expressions which refer
to abstract entities such as events, actions and plans.
The source texts in this release include
task-oriented dialogues from the TRAINS-91
and TRAINS-93
corpora (the latter released through LDC, TRAINS Spoken Dialog
Corpus LDC95S25), narratives from the English Pear Stories,
articles from the Wall Street Journal portions of the Penn Treebank (Treebank-2
LDC95T7) and the RST Discourse Treebank LDC2002T07, and the
Vieira/Poesio Corpus which consists of training and test files
from Treebank-2 and RST Discourse Treebank.
The texts were annotated using the ARRAU
guidelines which treat all noun phrases (NPs) as markables.
Different semantic roles are recognized by distinguishing between
referring expressions (that update or refer to a discourse model),
and non-referring ones (including expletives, predicative
expressions, quantifiers, and coordination). A variety of
linguistic features were also annotated, including morphosyntactic
agreement, grammatical function, semantic type (person, animate,
concrete, action, time, other abstract) and genericity. The
annotation was carried out using the MMAX2 annotation tool
which allows text units to be marked at different levels.
The files in MMAX format have been organized so
that they can be visualized using the MMAX2 tool or directly used
as input/output for the BART
toolkit which performs automatic coreference resolution
including all necessary preprocessing steps.
The ARRAU Corpus of Anaphoric Information is
distributed via web download.
2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.