Linguistic Data Consortium: March 2013

Friday, March 15, 2013

LDC March 2013 Newsletter

LDC’s 20th Anniversary: Concluding a Year of Celebration

New publications:

1993-2007 United Nations Parallel Text

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web

LDC’s 20th Anniversary: Concluding a Year of Celebration

We’ve enjoyed celebrating our 20th Anniversary this last year (April 2012 - March 2013) and would like to review some highlights before its close.

Our 2012 User Survey, circulated early in 2012, included a special Anniversary section in which respondents were asked to reflect on their opinions of, and dealings with, LDC over the years. We were humbled by the response. Multiple users mentioned that they would not be able to conduct their research without LDC and its data. For a full list of survey testimonials, please click here.

LDC also developed its first-ever timeline (initially published in the April 2012 Newsletter) marking significant milestones in the consortium’s founding and growth.

In September, we hosted a 20th Anniversary Workshop that brought together many friends and collaborators to discuss the present and future of language resources.

Throughout the year, we conducted several interviews of long-time LDC staff members to document their unique recollections of LDC history and to solicit their opinions on the future of the Consortium. These interviews are available as podcasts on the LDC Blog.

As our Anniversary year draws to a close, one task remains – to thank all of LDC’s past, present and future members and other friends of the Consortium for their loyalty and for their contributions to the community. LDC would not exist if not for its supporters. The variety of relationships that LDC has built over the years is a direct reflection of the vitality, strength and diversity of the community. We thank you all and hope that we continue to serve your needs in our third decade and beyond.

For a last treat, please visit LDC’s newly-launched YouTube channel to enjoy this video montage of the LDC staff interviews featured in the podcast series.

Thank you again for your continued support!

New publications

(1) 1993-2007 United Nations Parallel Text was developed by Google Research. It consists of United Nations (UN) parliamentary documents from 1993 through 2007 in the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish.

UN parliamentary documents are available from the UN Official Document System (UN ODS). UN ODS, in its main UNDOC database, contains the full text of all types of UN parliamentary documents. It has complete coverage datng from 1993 and variable coverage before that. Documents exist in one or more of the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS also contains a large number of German documents, marked with the language other, but these are not included in this dataset.

LDC has released parallel UN parliamentary documents in English, French and Spanish spanning the period 1988-1993, UN Parallel Text (Complete) (LDC94T4A).

The data is presented as raw text and word-aligned text. There are 673,670 raw text documents and 520,283 word aligned documents. The raw text is very close to what was extracted from the original word processing documents in UN ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding. The word-aligned text was normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential chunk-pairs, and then aligned at the word. The sentence, chunk, and word alignment operations were performed separately for each individual language pair.

1993-2007 United Nations Parallel Text is distributed on 3 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data provided they have completed the UN Parallel Text Corpus User Agreement. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web was developed by LDC and contains 158,387 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source web data (newsgroup, weblog) collected by LDC between 2005-2010. The distribution by words, character tokens and segments appears below:

Language	Files	Words	CharTokens	Segments
Chinese	1,224	105,591	158,387	4,836

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging 8 different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, March 7, 2013

LDC Timeline: 1992 - 2012

LDC Timeline – Two Decades of Milestones

April 15, 2012 marked the “official” 20th anniversary of LDC’s founding. As our Anniversary year draws to a close, LDC would like to share with the blogging community a brief timeline of some significant milestones.

1992: The University of Pennsylvania is chosen as the host site for LDC in response to a call for proposals issued by DARPA; the mission of the new consortium is to operate as a specialized data publisher and archive guaranteeing widespread, long-term availability of language resources. DARPA provides seed money with the stipulation that LDC become self-sustaining within five years. Mark Liberman assumes duties as LDC’s Director with a staff that grows to four, including Jack Godfrey, the Consortium’s first Executive Director.
1993: LDC’s catalog debuts. Early releases include benchmark data sets such as TIMIT, TIPSTER, CSR and Switchboard, shortly followed by the Penn Treebank.
1994: LDC and NIST (the National Institute of Standards and Technology) enter into a Cooperative R&D Agreement that provides the framework for the continued collaboration between the two organizations.
1995: Collection of conversational telephone speech and broadcast programming and transcription commences. LDC begins its long and continued support for NIST common task evaluations by providing custom data sets for participants. Membership and data license fees prove sufficient to support LDC operations, satisfying the requirement that the Consortium be self-sustaining.
1996: The Lexicon Development Project, under the direction of Dr. Cynthia McLemore, begins releasing pronouncing lexicons in Mandarin, German, Egyptian Colloquial Arabic, Spanish, Japanese and American English. By 1997, all are published.
1997: LDC announces LDC Online, a searchable index of newswire and speech data with associated tools to compute n-gram models, mutual information and other analyses.
1998: LDC adds annotation to its task portfolio. Christopher Cieri joins LDC as Executive Director and develops the annotation operation.
1999: Steven Bird joins LDC; the organization begins to develop tools and best practices for general use. The Annotation Graph Toolkit results from this effort.
2000: LDC expands its support of common task evaluations from providing corpora to coordinating language resources across the program. Early examples include the DARPA TIDES, EARS and GALE programs.
2001: The Arabic treebank project begins.
2002: LDC moves to its current facilities at 3600 Market Street, Philadelphia with a full-time staff of approximately 40 persons.
2004: LDC introduces the Standard and Subscription membership options, allowing members to choose whether to receive all or a subset of the data sets released in a membership year.
2005: LDC makes task specifications and guidelines available through its projects web pages.
2008: LDC introduces programs that provide discounts for continuing members and those who renew early in the year.
2010: LDC inaugurates the Data Scholarship program for students with a demonstrable need for data.
2012: LDC’s full-time staff of 50 and 196 part-time staff support ongoing projects and operations which include collecting, developing and archiving data, data annotation, tool development, sponsored-project support and multiple collaborations with various partners. The general catalog contains over 500 holdings in more than 50 languages. Over 85,000 copies of more than 1300 titles have been distributed to over 3200 organizations in 70 countries.

2012 User Survey Testimonials

As LDC's 20th Anniversary Year draws to a close, we would like to take this opportunity to share a few more Anniversary year activities with you.

In early 2012, LDC circulated a user survey to recent members and data licensees. Part of this survey focused on our then forthcoming Anniversary year and asked if respondents would provide anonymous testimonials supporting LDC. We are happy to report that many respondents took part and you may browse a selection of their comments below. Many humored LDC by playing along with the suggestion to describe the Consortium in one word or to compare LDC to a color, fruit or animal. LDC was humbled by the outpouring of support and would like to again thank all of our members and the entire community for continuously supporting the Consortium's existence.

2012 LDC User Survey Testimonials

· If LDC did not exist, it would have to be invented. It provides critical resources for the speech technology community.

· I wish that I were more ambitious and could use all of the datasets the LDC provides!

· I recently started as a new Assistant Professor in an undergraduate college with little access to research funds. The LDC staff bent over backward to allow me access to the materials I needed without the budget of a research university.

· Thanks for the good work.

· Keep on publishing.

· Timely and competent follow-up from LDC staff regarding any queries or problems

· I like LDC because they are very professional, very responsive, charge reasonable fees and have very friendly and helpful personnel

· Researchers in public institutions need organizations like LDC.

· Happy birthday LDC. Keep up the good work!

· Congratulations for your hard work, and for sharing tools with the world

· LDC is a great speech corpus provider for worldwide languages.

· LDC is best of breed in providers of high quality curated textual data, including some very large data sets.

· LDC is a great resource for researchers - keeping up with the times with new databases each year.

· (Organization name withheld) would like to extend sincere greetings to the LDC and to its great team, and a sincere "THANK YOU" for the wonderful service you have provided. May we celebrate your 100th anniversary!

· I like LDC because they provide good service at a reasonable price for academic institutions.

· There's no data like more data, and LDC is where it's at.

· I like LDC because it relieves us from troublesome negotiations with each provider of language resources.

· LDC is great. If it were a color, it would be teal (very hip).

· Blue as the sea because it helps researcher irrigate their research ideas.

· I would like to consider LDC as watermelon for its skin is green which is the symbol of flourishing life, the pulp is red which is the symbol of hope and success and the black seed is the essence of cohesion. In all, for researchers, LDC is very essential.

· Fruit: pomegranate, single body, many multiple frutties

· Description of LDC in 7 words: many corpora of a very high quality

· Describe LDC in one word: Astronomical