Two New LDC Podcasts for your Listening
Pleasure
New publications:
The deadline for
the Spring 2013 LDC Data Scholarship Program is one month away! Student
applications are being accepted now through January 15, 2013, 11:59PM
EST. The LDC Data Scholarship program provides university students with
access to LDC data at no cost. This program is open to students pursuing
both undergraduate and graduate studies in an accredited college or university.
LDC Data Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research agenda and a bona
fide inability to pay.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can
email their applications to the LDC Data Scholarship program.
Decisions will be sent by email from the same address.
Two new podcasts
are available on a the LDC
blog continuing the 20th Anniversary
series. The first features Natalia Bragilevskaya, LDC’s Business Administrator,
Membership Coordinator Ilya Ahtaridis and Marian Reed, Marketing Coordinator.
They recall the early days of LDC and describe the growth of sponsored projects
work and LDC’s interactions with its membership.
Click
here for Natalia, Ilya and
Marian’s podcast.
The third podcast
in the series introduces the community to two LDC, researchers Yiwola
Awoyale and Moussa Bamba, whose work focuses on West African languages.
Yiwola has been
teaching Linguistics, Yoruba language studies and various aspects of African
linguistics since 1975. At LDC, he developed the Global Yoruba Lexical Database, a set of related dictionaries based on Yoruba
and its diaspora. Moussa’s work in the Manding languages of the Niger-Congo
family has resulted in the release of the Mawukakan Lexicon, to be
followed by similar resources for Maninkakan, Bambara, and Jula.
In their podcast,
Yiwola and Moussa discuss how they came to LDC, their current research
and how it benefits multiple communities. Click here for Yiwola and
Moussa’s podcast.
Other podcasts
will be published via the LDC blog, so stay tuned to that space.
The developers of the Penn Discourse
Treebank Version 2.0 LDC2008T05 (PDTB) have
updated this release to add metadata to the Wall Street Journal (WSJ) news
stories in the corpus. The goal is to aid understanding PDTB files as texts and
to support distinguishing texts from different genres within the WSJ.
The metadata includes the following fields:
- DD: the date the article appeared in the WSJ
- AN: unique identifier for the article
- HL: the column name (for regular features such as Who's News, Marketing & Media, Technology), its headline and by-line
- SO: the source of the article
- IN: manually-assigned codes or keywords for the article
- CO: manually-assigned codes for companies or other organizations
- DATELINE: normally the location where the article was filed, but sometimes has very unexpected contents
- GV: Branch of Government or Government Agency mentioned in the article
- SBREAKS: the byte position of section breaks present in the file
- ARTICLEBREAK: separates files that contain more than one article
All new downloads of PDTB will contain
the complete updated corpus. Current PDTB licensees can re-download the
file to obtain the updated data.
LDC will be closed from Monday,
December 24, 2012 through Tuesday, January 1, 2013 in accordance with the
University of Pennsylvania Winter Break Policy. Our offices will reopen
on Wednesday, January 2, 2013. Requests received for membership renewals
and corpora during the Winter Break will be processed at that time.
Best wishes for a happy and safe holiday season!
Best wishes for a happy and safe holiday season!
New
publications
(1)
GALE Chinese-English Word Alignment
and Tagging Training Part 3 -- Web was developed by LDC and contains
154,541 tokens of word aligned Chinese and English parallel text enriched with
linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language
Exploitation) program.
Some approaches
to statistical machine translation include the incorporation of linguistic
knowledge in word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two annotation
schemes: alignment and tagging. Alignment identifies minimum translation units
and translation relations by using minimum-match and attachment annotation
approaches. A set of word tags and alignment link tags are designed in the
tagging scheme to describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the alignment annotation.
GALE
Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web
(LDC2012T16) and GALE Chinese-English Word Alignment and Tagging Training Part
3 -- Web (LDC2012T20) are also available through LDC.
This release consists of Chinese source web data (newsgroup, weblog) collected by LDC in 2008 and 2009. The distribution by words, character tokens and segments appears below:
This release consists of Chinese source web data (newsgroup, weblog) collected by LDC in 2008 and 2009. The distribution by words, character tokens and segments appears below:
Language: Chinese
Files: 1249
Words: 103027
CharTokens: 154541
Segments: 4842
Note that all token counts are based on the Chinese data only. One token is
equivalent to one character and one word is equivalent to 1.5 characters.
The Chinese word
alignment tasks consisted of the following components:
- Identifying, aligning, and tagging 8 different types of links
- Identifying, attaching, and tagging local-level unmatched words
- Identifying and tagging sentence/discourse-level unmatched words
- Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.
GALE
Chinese-English Word Alignment and Tagging Training Part 3 -- Web is
distributed via web download.2012 Subscription
Members will automatically receive two copies of this data on disc. 2012
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for US$1750.
*
(2) Russian-English Computer Security
Parallel Text was developed by The MITRE Corporation. It
consists of parallel sentences from a set of computer security reports
published in Russian and translated into English by translators with particular
expertise in the technical area. Translators were instructed to err on the side
of literal translation if required, but to maintain the technical writing style
of the source and to make the resulting English as natural as possible. The
translators followed specific guidelines for translation, and those are
included in this distribution.
There are 6,276
lines of parallel Russian and English, with a total of 60,059 words of Russian
and 76,437 words of English, presented in a separate UTF-8 plain text file for
each language. The sentences were translated in sequential order and presented
in a scrambled order, such that parallel sentences at identical line numbers
are translations. For example, the 31st line of the English file is a
translation of the 31st line of the Russian file. The original line sequence is
not provided. 1,694 untranslated lines (such as code snippets) are included as
a separate file.
Russian-English
Computer Security Parallel Text is distributed via web download. 2012 Subscription
Members will automatically receive two copies of this data on disc. 2012
Standard Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for US$1500.