LDC’s
20th Anniversary: Concluding a Year of Celebration
New
publications:
1993-2007
United Nations Parallel Text
GALE
Chinese-English Word Alignment and Tagging Training Part 4 --
Web
LDC’s
20th Anniversary: Concluding a Year of Celebration
We’ve enjoyed celebrating our 20th Anniversary this last year (April 2012 - March 2013) and would like to review some highlights before its close.
Our 2012 User Survey, circulated early in 2012, included a special Anniversary section in which respondents were asked to reflect on their opinions of, and dealings with, LDC over the years. We were humbled by the response. Multiple users mentioned that they would not be able to conduct their research without LDC and its data. For a full list of survey testimonials, please click here.
LDC also developed its first-ever timeline (initially published in the April 2012 Newsletter) marking significant milestones in the consortium’s founding and growth.
In September, we hosted a 20th Anniversary Workshop that brought together many friends and collaborators to discuss the present and future of language resources.
Throughout the year, we conducted several interviews of long-time LDC staff members to document their unique recollections of LDC history and to solicit their opinions on the future of the Consortium. These interviews are available as podcasts on the LDC Blog.
As our Anniversary year draws to a close, one task remains – to thank all of LDC’s past, present and future members and other friends of the Consortium for their loyalty and for their contributions to the community. LDC would not exist if not for its supporters. The variety of relationships that LDC has built over the years is a direct reflection of the vitality, strength and diversity of the community. We thank you all and hope that we continue to serve your needs in our third decade and beyond.
For a last treat, please visit LDC’s newly-launched YouTube channel to enjoy this video montage of the LDC staff interviews featured in the podcast series.
Thank you again for your continued support!
New publications
(1) 1993-2007
United
Nations Parallel Text was developed by Google Research. It
consists of United Nations (UN) parliamentary documents from 1993
through 2007 in the official languages of the UN: Arabic, Chinese,
English, French, Russian, and Spanish.
UN parliamentary documents are available from
the UN Official Document System (UN
ODS). UN ODS, in its main UNDOC database, contains the full
text of all types of UN parliamentary documents. It has complete
coverage datng from 1993 and variable coverage before that.
Documents exist in one or more of the official languages of the
UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS
also contains a large number of German documents, marked with the
language other, but these are not included in this dataset.
LDC has released parallel UN parliamentary
documents in English, French and Spanish spanning the period
1988-1993, UN
Parallel
Text (Complete) (LDC94T4A).
The data is presented as raw text and
word-aligned text. There are 673,670 raw text documents and
520,283 word aligned documents. The raw text is very close to what
was extracted from the original word processing documents in UN
ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding.
The word-aligned text was normalized, tokenized, aligned at the
sentence-level, further broken into sub-sentential chunk-pairs,
and then aligned at the word. The sentence, chunk, and word
alignment operations were performed separately for each individual
language pair.
1993-2007 United Nations Parallel Text is
distributed on 3 DVD-ROM. 2013 Subscription Members will automatically
receive two copies of this data provided they have completed the UN
Parallel Text Corpus User Agreement. 2013 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) GALE
Chinese-English
Word Alignment and Tagging Training Part 4 -- Web was
developed by LDC and contains 158,387 tokens of word aligned
Chinese and English parallel text enriched with linguistic tags.
This material was used as training data in the DARPA GALE
(Global Autonomous Language Exploitation) program.
Some approaches to statistical machine
translation include the incorporation of linguistic knowledge in
word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word
tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the
alignment annotation.
This release consists of Chinese source web
data (newsgroup, weblog) collected by LDC between 2005-2010. The
distribution by words, character tokens and segments appears
below:
Language
|
Files
|
Words
|
CharTokens
|
Segments
|
Chinese
|
1,224
|
105,591
|
158,387
|
4,836
|
Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.
The Chinese word alignment tasks consisted of
the following components:
Identifying, aligning, and tagging 8 different
types of links
Identifying, attaching, and tagging local-level
unmatched words
Identifying and tagging
sentence/discourse-level unmatched words
Identifying and tagging all instances of
Chinese 的(DE)
except
when they were a part of a semantic link.
GALE Chinese-English Word Alignment and Tagging
Training Part 4 -- Web is distributed via web download. 2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.