Linguistic Data Consortium

Friday, November 16, 2012

LDC November 2012 Newsletter

Spring 2013 LDC Data Scholarship Program

Invitation to Join for Membership Year 2013

Why become an LDC member?

2012 User Survey Results
LDC to Close for Thanksgiving Break

New publications:

Annotated English Gigaword

Chinese-English Semiconductor Parallel Text

GALE Phase 2 Arabic Newswire Parallel Text

Spring 2013 LDC Data Scholarship Program

Applications are now being accepted through January 15, 2013, 11:59PM EST for the Spring 2013 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 25 individual students and student research groups.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two datasets; students may apply for additional datasets during the following cycle once they have completed processing of the initial datasets and publish or present work in some juried venue.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full Non-member Fee for the data or to join the consortium.

For further information on application materials and program rules, please visit the LDC Data Scholarship page. Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address. The deadline for the Spring 2013 program cycle is January 15, 2013, 11:59 PM EST.

Invitation to Join for Membership Year 2013

Membership Year (MY) 2013 is open for joining! We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the consortium. For MY2013, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase.

Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.

The details of our early renewal discounts for MY2013 are as follows:

· Organizations who joined for MY2012 will receive a 5% discount when renewing. This discount will apply throughout 2013, regardless of time of renewal. MY2012 members renewing before March 1, 2013 will receive an additional 5% discount, for a total 10% discount off the membership fee.

· New members as well as organizations who did not join for MY2012, but who held membership in any of the previous MYs (1993-2011), will also be eligible for a 5% discount provided that they join/renew before March 1, 2013.

The following table provides exact pricing information.

		MY2013 Fee	MY2013 Fee with 5% Discount*	MY2013 Fee with 10% Discount**
Not-for-Profit /US Government
	Standard	US$2400	US$2280	US$2160
	Subscription	US$3850	US$3658	US$3465
For-Profit
	Standard	US$24000	US$22800	US$21600
	Subscription	US$27500	US$26125	US$24750

* For new members, MY2012 Members renewing for MY2013, and any previous year Member who renews before March 1, 2013

** For MY2012 Members renewing before March 1, 2013

Publications for MY2013 are still being planned; here are the working titles of data sets we intend to provide:

- Arabic Treenbank – Weblog

- Chinese-English Biomedical Parallel Text

- GALE data – all phases and tasks

- Hispanic-English Speech

- Maninkakan Lexicon

- OpenMT 2008-2012 Progress set

In addition to receiving new publications, current year members of the LDC also enjoy the benefit of licensing older data at reduced costs; current year for-profit members may use most data for commercial applications.

This past year, LDC members who joined early or kept their membership current saved almost US$70,000 collectively on membership fees. Be sure to keep an eye on your mail - all previous and current LDC members will be sent an invitation to join letter and renewal invoice for MY2013. Renew early for MY2013 to save today!

Why become an LDC member?

LDC is offering early renewal discounts on membership fees for Membership Year 2013 making now a good time to consider joining or renewing membership. LDC membership has the following advantages:

LDC membership provides cost-effective access to an extensive and growing catalog that spans 20 years and includes over 500 multilingual speech, text, and video resources. Even if your organization only needs a few datasets from a given membership year, membership is often the most economical way to obtain current corpora. Additionally, the generous discounts that member organizations receive on older corpora reduce the cost of acquiring such datasets.

All members enjoy unlimited use of LDC data within their organizations. For universities, there is no difference in cost between a departmental membership and one that is university-wide. Departments can therefore combine resources and establish one LDC membership for use by the entire university community. Likewise, for-profit members with multiple branches can maintain one membership for use by their entire organizations.

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations, including commercial restrictions, on the use of certain corpora. In the case of a small group of corpora, commercial licenses must be obtained separately from the owners of the data.

2012 User Survey Results

Earlier this year, LDC sent a survey to its user communities. Like previous iterations in 2006 and 2007, the survey solicited community input and suggestions on key LDC-related topics, including:

- Satisfaction levels with LDC’s data, homepage and Catalog

- Reflections on LDC’s 20^th Anniversary year

- Suggestions for future publications

- Speculations on the future of HLT-related fields, specifically on mobile technologies, cloud computing, social networking and open data

Survey respondents were generally satisfied with LDC’s data, membership options, homepage and Catalog, though there were requests for additional data options and data acquisition methods. Some of the data respondents requested are already in our pipeline for the end of 2012 or for Membership Year (MY) 2013, so please be on the lookout for Publications updates.

Respondents were also very supportive of LDC’s 20^th Anniversary, posting testimonials and well-wishes in the 20^th Anniversary section.

LDC would like to thank all survey participants. Survey participants will receive access to full survey results shortly.

LDC to Close for Thanksgiving Break

LDC will be closed on Thursday, November 22, 2012 and Friday, November 23, 2012 in observance of the US Thanksgiving Holiday. Our offices will reopen on Monday, November 26, 2012.

New publications

(1) Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an API and tools for reading the dataset's XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers.

Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources:

Agence France-Presse, English Service (afp_eng)
Associated Press Worldstream, English Service (apw_eng)
Central News Agency of Taiwan, English Service (cna_eng)
Los Angeles Times/Washington Post Newswire Service (ltw_eng)
Washington Post/Bloomberg Newswire Service (wpb_eng)
New York Times Newswire Service (nyt_eng)
Xinhua News Agency, English Service (xin_eng)

The following layers of annotation were added:

Tokenized and segmented sentences
Treebank-style constituent parse trees
Syntactic dependency trees
Named entities
In-document coreference chains

The annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded); (2) syntactic parses were derived; and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed.

Annotated English Gigaword is distributed on one hard drive. 2012 Subscription Members will automatically receive one copy of this data on hard drive. 2012 Standard Members may request a copy as part of their 16 free membership corpora. 2011 Members who licensed English Gigaword Fifth Edition (LDC2011T07) may request a no-cost copy of Annotated English Gigaword. Non-member organizations who licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for the US$200 media fee.

(2) Chinese-English Semiconductor Parallel Text was developed by The MITRE Corporation. It consists of parallel sentences from a collection of abstracts from scientific articles on semiconductors published in Mandarin and translated into English by translators with particular expertise in the technical area. Translators were instructed to err on the side of literal translation if required, but to maintain the technical writing style of the source and to make the resulting English as natural as possible. The translators followed specific guidelines for translation, and those are included in this distribution.

There are 2,169 lines of parallel Mandarin and English, with a total of 125,302 characters of Mandarin and 64,851 words of English, presented in a separate UTF-8 plain text file for each language. The sentences were translated in sequential order and presented in a scrambled order, such that parallel sentences at identical line numbers are translations. For example, the 31st line of the English file is a translation of the 31st line of the Mandarin file. The original line sequence is not provided.

Chinese-English Semiconductor Parallel Text is distributed via web download.

2012 Subscription Members will automatically receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora.

(3) GALE Phase 2 Arabic Newswire Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from newswire data collected in 2007 by LDC and transcribed by LDC or under its direction.

GALE Phase 2 Arabic Newswire Parallel Text includes 400 source-translation pairs, comprising 181,704 tokens of Arabic source text and its English translation. Data is drawn from six distinct Arabic newswire sources: Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 2 Arabic Newswire Parallel Text is distributed via web download.

2012 Subscription Members will automatically receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora.

Thursday, October 18, 2012

LDC October 2012 Newsletter

Fall 2012 LDC Data Scholarship Recipients

LDC Exhibiting at NWAV 41

LDC 20th Anniversary Workshop Wrap-up

LDC 20th Anniversary Podcasts

Language Resource Wiki

New publications:

LDC2012T20

- GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire -

LDC2012T18

- GALE Phase 2 Arabic Broadcast News Parallel Text -

Fall 2012 LDC Data Scholarship Recipients

LDC is pleased to announce the student recipients of the Fall 2012 LDC Data Scholarship program! This program provides university and college students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen six proposals to support. The following students will receive no-cost copies of LDC data:

Jaffar Atwan - National University of Malaysia (Malaysia), Phd candidate, Information Science and Technology. Jaffar has been awarded a copy of Arabic Newswire Part 1 (LDC2001T55) for his work in information retrieval.

Sarath Chandar - Indian Institute of Technology, Madras (India), MS candidate, Computer Science and Engineering. Sarath has been awarded a copy of Treebank-3 (LDC99T42) for his work in grammar induction.

Kuruvachan K. George - Amrita Vishwa Vidyapeetham (India), Phd Candidate, Electrical and Computer Engineering. Kuruvachan has been awarded a copy of Fisher English Part 2 (LDC2005S13/T19) and 2008 NIST Speaker Recognition Evaluation data (LDC2011S05/07/08/11) for his work in speaker recognition.

Eduardo Motta - Pontifícia Universidade Católica do Rio de Janeiro (Brazil), Phd candidate, Information Sciences. Eduardo has been awarded a copy of English Web Treebank (LDC2012T13) for his work in machine learning.

Genevieve Sapijaszko - University of Central Florida (USA), Phd Candidate, Electrical and Computer Engineering. Genevieve has been awarded a copy TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) and YOHO Speaker Verification (LDC94S16) for her work in digital signal processing.

John Steinberg - Temple University (USA), MS candidate, Electrical and Computer Engineering. John has been awarded a copy of CALLHOME Mandarin Chinese Lexicon (LDC96L15) and CALLHOME Mandarin Chinese Transcripts (LDC96T16) for his work in speech recognition.

LDC Exhibiting at NWAV 41

LDC will be exhibiting at the 41st New Ways of Analyzing Variation Conference (NWAV 41) in late October. This marks the fifth time that LDC has been an NWAV exhibitor and we are proud to show our continued support of the sociolinguistic research community.

The conference runs from October 25-28 and the exhibition hall will be open from October 26-28, 2012. Please stop by to say hello!

LDC 20th Anniversary Workshop Wrap-up

In early September, LDC hosted a workshop entitled “The Future of Language Resources” in celebration of our 20th anniversary. Visit the Program page to browse speaker abstracts and to access pdfs of the presentations. Thanks to the speakers and attendees for making the workshop a success!

LDC 20th Anniversary Podcasts

To further celebrate our 20th Anniversary, LDC is conducting interviews of long-time staff members for their unique perspectives on the Consortium’s growth and evolution over the past two decades. The first interview podcast debuts this month and features Dave Graff, LDC’s Lead Programmer. Visit the LDC blog to access the podcast.

Other podcasts will be published via the LDC blog, so stay tuned to that space.

Language Resource Wiki

The Language Resource Wiki catalogs data, software, descriptive grammars and other resources for a variety of languages but especially those with a paucity of generally available resources for research. LDC is actively seeking editors knowledgeable in these and other languages to develop and maintain the pages, which are readable by anyone but writable only by editors. The wiki currently has resource listings for: Bengali, Berber, Breton, Ewe, Greek (Ancient), Indonesian, Hindi, Latin, Panjabi, Pashto, Sorani (Central Kurdish), Russian, Tagalog, Tamil, and Urdu, and for the following Sign Languages: American, British, Catalan, Dutch, Flemish, German, Japanese, New Zealand, Polish, Spanish, and Swiss German.

New publications

(1) GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire was developed by LDC and contains 169,080 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging 8 different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.

GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire is distributed via web download. 2012 Subscription Members will automatically receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora.

(2) GALE Phase 2 Arabic Broadcast News Parallel Text was developed by LDC, and along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast news (BN) data collected by LDC between 2005 and 2007 and transcribed by LDC or under its direction.

GALE Phase 2 Arabic Broadcast News Parallel Text includes seven source-translation pairs, comprising 29,210 words of Arabic source text and its English translation. Data is drawn from six distinct Arabic programs broadcast between 2005 and 2007 from Abu Dhabi TV, based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Aljazeera, a regional broadcast programmer based in Doha, Qatar; Dubai TV, based in Dubai, United Arab Emirates; and Kuwait TV, a national television station based in Kuwait. The BN programming in this release focuses on current events topics.

GALE Phase 2 Arabic Broadcast News Parallel Text is distributed via web download. 2012 Subscription Members will automatically receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora.

Thursday, October 11, 2012

LDC 20th Anniversary Podcasts: David Graff

As part of our 20th Anniversary celebrations, LDC is conducting interviews of long-time staff members for their unique perspectives on the Consortium's growth and evolution over the past two decades and for some insights into the future. We expect to make these interviews available as audio, video and text. The interviews are conducted by John Vogel, LDC part-time staffer, musician and video artist.

We begin with a series of podcasts. The first podcast features David Graff, LDC's Lead Programmer. Dave has been at LDC since its first days as a small organization occupying one of the many offices in University of Pennsylvania's Williams Hall. Dave has been involved in many aspects of LDC's work over the years; he currently designs tools that support corpus creation, annotation and quality assessment and has a direct role in the production of most LDC publications.

We hope you enjoy Dave's reflections on life at LDC.

Click here for Dave's podcast.

Monday, September 17, 2012

LDC September 2012 Newsletter

The Future of Language Resources: LDC 20th Anniversary Workshop Summary

English Treebanking at LDC

New publications

LDC2012T16

- GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web -

LDC2012T15

- MADCAT Phase 1 Training Set -

The Future of Language Resources: LDC 20th Anniversary Workshop Summary

Thanks to the members, friends and staff who made our 20th Anniversary Workshop (September 6-7) a fruitful and fun experience. The speakers -- from academia, industry and government – engaged participants and provoked discussion with their talks about the ways in which language resources contribute to research in language-related fields and other disciplines and with their insights into the future. The result was much food for thought as we enter our third decade.

Visit the workshop page for the proceedings and to learn more about the event.

English Treebanking at LDC

As part of our 20th anniversary celebration, the coming newsletters will include features that provide an overview of the broad range of LDC’s activities. This month, we'll examine English treebanking efforts at LDC. The English treebanking team is lead by Ann Bies, Senior Research Coordinator. The association of treebanks with LDC began with the publication of the original Penn English Treebank (Treebank-2) in 1995. Since that time the need for new varieties of English treebank data has continued to grow, and LDC has expanded its expertise to address new research challenges. This includes the development of treebanked data for additional domains including conversational speech and web text as well as the creation of parallel treebank data.

Speech data presents unique challenges not inherent in edited text such as speech disfluency and hesitations. Penn Treebank contains conversational speech data from the Switchboardtelephone collection which has been tagged, dysfluency-annotated, and parsed. LDC’s more recent publication, English CTS Treebank with Structural Metadata, builds on that annotation and includes new data. The development of that corpus was motivated by the need to have both structural metadata and syntactic structure annotated in order to support work on speech parsing and structural event detection. The annotation involved a two-pass approach to annotating metadata, speech effects and syntactic structure in transcribed conversational speech: separately annotating for structural metadata, or structural events, and for syntactic structure. The two annotations were then combined into a single aligned representation.

Also recently, LDC has undertaken complex syntactic annotation of data collected over the web. Since most parsers are trained using newswire, they achieve better accuracy on similar heavily edited texts. LDC, through a gift from Google Inc., developed English Web Treebank to improve parsing, translation and information extraction on unedited domains, such as blogs, newsgroups, and consumer reviews. LDC’s annotation guidelines were adapted to handle unique features of web text such as inconsistent punctuation and capitalization as well as the increased use of slang, technical jargon and ungrammatical sentences.

LDC and its research partners are also involved in the creation of parallel treebanks used for word alignment tasks. Parallel treebanks are annotated morphological and syntactic structures that are aligned at sentence as well as sub-sentence levels. These resources are used for improving machine translation quality. To create such treebanks, English files (translated from the source Arabic or Chinese) are first automatically part-of-speech tagged and parsed and then hand-corrected at each stage. The quality control process consists of a series of specific searches for over 100 types of potential inconsistency and parser or annotation error. Parallel treebank data in the LDC catalog includes the English Translation Treebank: An Nahar Newswire whose files are parallel with those in Arabic Treebank: Part 3 v 3.2

English treebanking at LDC is ongoing; new titles are in progress and will be added to our catalog.

New Publications

(1) GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web was developed by LDC and contains 150,068 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. This release consists of Chinese source newswire and web data (newsgroup, weblog) collected by LDC in 2008.

The Chinese word alignment tasks consisted of the following components:

-Identifying, aligning, and tagging 8 different types of links

-Identifying, attaching, and tagging local-level unmatched words

-Identifying and tagging sentence/discourse-level unmatched words

-Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link.

GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web is distributed via web download. 2012 Subscription Members will automatically receive two copies of this data on CD. 2012 Standard Members may request a copy as part of their 16 free membership corpora.

(2) MADCAT Phase 1 Training Set contains all training data created by LDC to support Phase 1 of the DARPA MADCAT Program. The data in this release consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output.

The goal of the MADCAT program is to automatically convert foreign text images into English transcripts. MADCAT Phase 1 data was collected by LDC from Arabic source documents in three genres: newswire, weblog and newsgroup text. Arabic speaking "scribes" copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple "pages" for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions.

The handwritten, transcribed documents were checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.

The final step was to produce a unified data format that takes multiple data streams and generates a single xml output file which contains all required information. The resulting xml file has these distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. This release includes 9693 annotation files in MADCAT XML format (.madcat.xml) along with their corresponding scanned image files in TIFF format.

MADCAT Phase 1 Training Set is distributed on two DVD-ROM. 2012 Subscription Members will automatically receive two copies of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora.