New publications:
LDC2012T20
LDC2012T18
Fall 2012 LDC Data Scholarship Recipients
LDC is pleased to announce the student recipients
of the Fall 2012 LDC Data Scholarship program! This program
provides university and college students with access to LDC data
at no-cost. Students were asked to complete an application which
consisted of a proposal describing their intended use of the
data, as well as a letter of support from their thesis adviser.
We received many solid applications and have chosen six proposals to support. The
following students will receive no-cost copies of LDC data:
Jaffar Atwan - National University of Malaysia (Malaysia), Phd candidate, Information Science and Technology. Jaffar has been awarded a copy of Arabic Newswire Part 1 (LDC2001T55) for his work in information retrieval.
Sarath Chandar - Indian Institute of Technology, Madras (India), MS candidate, Computer Science and Engineering. Sarath has been awarded a copy of Treebank-3 (LDC99T42) for his work in grammar induction.
Kuruvachan K. George - Amrita Vishwa Vidyapeetham (India), Phd Candidate, Electrical and Computer Engineering. Kuruvachan has been awarded a copy of Fisher English Part 2 (LDC2005S13/T19) and 2008 NIST Speaker Recognition Evaluation data (LDC2011S05/07/08/11) for his work in speaker recognition.
Eduardo Motta - Pontifícia Universidade Católica do Rio de Janeiro (Brazil), Phd candidate, Information Sciences. Eduardo has been awarded a copy of English Web Treebank (LDC2012T13) for his work in machine learning.
Genevieve Sapijaszko - University of Central Florida (USA), Phd Candidate, Electrical and Computer Engineering. Genevieve has been awarded a copy TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) and YOHO Speaker Verification (LDC94S16) for her work in digital signal processing.
John Steinberg - Temple University (USA), MS candidate, Electrical and Computer Engineering. John has been awarded a copy of CALLHOME Mandarin Chinese Lexicon (LDC96L15) and CALLHOME Mandarin Chinese Transcripts (LDC96T16) for his work in speech recognition.
LDC will be exhibiting at the 41st New Ways of
Analyzing Variation Conference (NWAV 41) in late
October. This marks the fifth time that LDC has been an NWAV
exhibitor and we are proud to show our continued support of the
sociolinguistic research community.
The conference runs from October 25-28 and the exhibition
hall will be open from October 26-28, 2012. Please stop by
to say hello!
In early September, LDC hosted a workshop
entitled “The Future of Language Resources” in celebration of our 20th anniversary. Visit
the Program
page to browse speaker abstracts and to access pdfs of the
presentations. Thanks to
the speakers and attendees for making the workshop a success!
To further celebrate our 20th Anniversary, LDC is
conducting interviews of
long-time staff members for their unique perspectives on the
Consortium’s growth and evolution over the past two decades. The
first interview podcast debuts this month and features Dave
Graff, LDC’s Lead Programmer. Visit the LDC blog to access the podcast.
Other podcasts will be
published via the LDC
blog, so stay tuned to that space.
The Language Resource Wiki
catalogs data, software, descriptive grammars and other resources
for a variety of languages but especially those with a paucity of
generally available resources for research. LDC is actively
seeking editors knowledgeable in these and other languages to
develop and maintain the pages, which are readable by anyone but
writable only by editors. The wiki currently has resource listings
for: Bengali, Berber, Breton, Ewe, Greek (Ancient), Indonesian,
Hindi, Latin, Panjabi, Pashto, Sorani (Central Kurdish), Russian,
Tagalog, Tamil, and Urdu, and for the following Sign Languages:
American, British, Catalan, Dutch, Flemish, German, Japanese, New
Zealand, Polish, Spanish, and Swiss German.
New
publications
(1) GALE
Chinese-English
Word Alignment and Tagging Training Part 2 -- Newswire was
developed by LDC and contains 169,080 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags.
This material was used as training data in the DARPA
GALE (Global Autonomous Language Exploitation) program.
Some approaches to statistical machine translation
include the incorporation of linguistic knowledge in word
aligned text as a means to improve automatic word alignment and
machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of
word tags and alignment link tags are designed in the tagging
scheme to describe these translation units and relations.
Tagging adds contextual, syntactic and language-specific
features to the alignment annotation.
The Chinese word alignment tasks consisted of the following components:Identifying, aligning, and tagging 8 different types of linksIdentifying, attaching, and tagging local-level unmatched wordsIdentifying and tagging sentence/discourse-level unmatched wordsIdentifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.
GALE Chinese-English Word Alignment and Tagging
Training Part 2 -- Newswire is distributed via web download. 2012 Subscription Members will automatically
receive two copies of this data on disc. 2012 Standard Members
may request a copy as part of their 16 free membership corpora.
*
(2) GALE
Phase
2 Arabic Broadcast News Parallel Text was developed by
LDC, and along with other corpora, the parallel text in this
release comprised training data for Phase 2 of the DARPA GALE
(Global Autonomous Language Exploitation) Program. This corpus
contains Modern Standard Arabic source text and corresponding
English translations selected from broadcast news (BN) data
collected by LDC between 2005 and 2007 and transcribed by LDC or
under its direction.
GALE Phase 2 Arabic Broadcast News Parallel Text
includes seven source-translation pairs, comprising 29,210 words
of Arabic source text and its English translation. Data is drawn
from six distinct Arabic programs broadcast between 2005 and
2007 from Abu Dhabi TV, based in Abu Dhabi, United Arab
Emirates; Al Alam News Channel, based in Iran; Aljazeera, a
regional broadcast programmer based in Doha, Qatar; Dubai TV,
based in Dubai, United Arab Emirates; and Kuwait TV, a national
television station based in Kuwait. The BN programming in this
release focuses on current events topics.
The files in this release were transcribed by LDC
staff and/or transcription vendors under contract to LDC in
accordance with the Quick
Rich
Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to
several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format
and assigned to translation vendors. Translators followed LDC's
Arabic to English translation guidelines. Bilingual LDC staff
performed quality control procedures on the completed
translations.
GALE Phase 2 Arabic Broadcast News Parallel Text is
distributed via web download.
2012 Subscription Members will automatically receive two copies
of this data on disc. 2012 Standard Members may request a copy
as part of their 16 free membership corpora.