Friday, March 15, 2013

LDC March 2013 Newsletter

LDC’s 20th Anniversary: Concluding a Year of Celebration

New publications:
1993-2007 United Nations Parallel Text
GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web



LDC’s 20th Anniversary: Concluding a Year of Celebration

We’ve enjoyed celebrating our 20th Anniversary this last year (April 2012 - March 2013) and would like to review some highlights before its close.

Our 2012 User Survey, circulated early in 2012, included a special Anniversary section in which respondents were asked to reflect on their opinions of, and dealings with, LDC over the years. We were humbled by the response. Multiple users mentioned that they would not be able to conduct their research without LDC and its data. For a full list of survey testimonials, please click
here.

LDC also developed its first-ever
timeline  (initially published in the April 2012 Newsletter) marking significant milestones in the consortium’s founding and growth.

In September, we hosted a
20th Anniversary Workshop  that brought together many friends and collaborators to discuss the present and future of language resources.

Throughout the year, we conducted several interviews of long-time LDC staff members to document their unique recollections of LDC history and to solicit their opinions on the future of the Consortium. These interviews are available as podcasts on the
LDC Blog

As our Anniversary year draws to a close, one task remains – to thank all of LDC’s past, present and future members and other friends of the Consortium for their loyalty and for their contributions to the community. LDC would not exist if not for its supporters.  The variety of relationships that LDC has built over the years is a direct reflection of the vitality, strength and diversity of the community. We thank you all and hope that we continue to serve your needs in our third decade and beyond.


For a last treat, please visit LDC’s newly-launched YouTube channel to enjoy this
video montage of the LDC staff interviews featured in the podcast series.

Thank you again for your continued support!

New publications

(1) 1993-2007 United Nations Parallel Text was developed by Google Research. It consists of United Nations (UN) parliamentary documents from 1993 through 2007 in the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. 

UN parliamentary documents are available from the UN Official Document System (UN ODS). UN ODS, in its main UNDOC database, contains the full text of all types of UN parliamentary documents. It has complete coverage datng from 1993 and variable coverage before that. Documents exist in one or more of the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS also contains a large number of German documents, marked with the language other, but these are not included in this dataset.

LDC has released parallel UN parliamentary documents in English, French and Spanish spanning the period 1988-1993, UN Parallel Text (Complete) (LDC94T4A).

The data is presented as raw text and word-aligned text. There are 673,670 raw text documents and 520,283 word aligned documents. The raw text is very close to what was extracted from the original word processing documents in UN ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding. The word-aligned text was normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential chunk-pairs, and then aligned at the word. The sentence, chunk, and word alignment operations were performed separately for each individual language pair.

1993-2007 United Nations Parallel Text is distributed on 3 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data provided they have completed the UN Parallel Text Corpus User Agreement. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web was developed by LDC and contains 158,387 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. 

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation. 

This release consists of Chinese source web data (newsgroup, weblog) collected by LDC between 2005-2010. The distribution by words, character tokens and segments appears below: 

Language
Files
Words
CharTokens
Segments
Chinese
1,224
105,591
158,387
4,836

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components: 

Identifying, aligning, and tagging 8 different types of links
Identifying, attaching, and tagging local-level unmatched words
Identifying and tagging sentence/discourse-level unmatched words
Identifying and tagging all instances of Chinese (DE) except when they were a part of a semantic link.

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc.  2013 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

No comments:

Post a Comment