Linguistic Data Consortium: August 2012

LDC and Google Collaboration Results in New Syntactically-Annotated Language Resources

The Future of Language Resources: LDC 20th Anniversary Workshop

Fall 2012 LDC Data Scholarship Program

Spotlight on HAVIC

New publications:

LDC2012T13
- English Web Treebank -

LDC2012T14
- GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 –

LDC2012T12
- Spanish TimeBank 1.0 –

LDC and Google Collaboration Results in New Syntactically-Annotated Language Resources

Google Inc. and the Linguistic Data Consortium (LDC) have collaborated to develop new syntactically-annotated language resources that enable computers to better understand human language. The project, funded through a gift from Google in 2010, has resulted in the development of the English Web Treebank LDC2012T13 containing over 250,000 words of weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure. This resource will allow language technology researchers to develop and evaluate the robustness of parsing methods in various new web domains. It was used in the 2012 shared task on parsing English web text for the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL) which took place at NAACL-HLT in Montreal on June 8, 2012. The English Web Treebank is available to the research community through LDC’s Catalog.

Natural language processing (NLP) is a field of computational linguistic research concerned with the interactions between human language and computers. Parsing is a discipline within NLP in which computers analyze text and determine its syntactic structure. While syntactic parsing is already practically useful, Google funded this effort to help the research community develop better parsers for web text. The web texts collected and annotated by LDC provide new, diverse data for training parsing systems.

Google chose LDC for this work based on the Consortium’s experience in developing and creating syntactic annotations, also known as treebanks. Treebanks are critically important to parsing research since they provide human-analyzed sentence structures that facilitate training and testing scenarios in NLP research. This work extends the existing relationship between LDC and Google. LDC has published four other Google-developed data sets in the past six years: English, Chinese, Japanese and European language n-grams used principally for language modeling.

The Future of Language Resources: LDC 20th Anniversary Workshop

LDC’s 20th Anniversary Workshop is rapidly approaching! The event will take place on the University of Pennsylvania’s campus on September 6-7, 2012.

Workshop themes include: the developments in human language technologies and associated resources that have brought us to our current state; the language resources required by the technical approaches taken and the impact of these resources on HLT progress; the applications of HLT and resources to other disciplines including law, medicine, economics, the political sciences and psychology; the impact of HLTs and related technologies on linguistic analysis and novel approaches in fields as widespread as phonetics, semantics, language documentation, sociolinguistics and dialect geography; and the impact of any of these developments on the ways in which language resources are created, shared and exploited and on the specific resources required.

Please read more here.

Fall 2012 LDC Data Scholarship Program

Applications are now being accepted through September 17, 2012, 11:59PM EST for the Fall 2012 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 20 individual students and student research groups.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two datasets; students may apply for additional datasets during the following cycle once they have completed processing of the initial datasets and publish or present work in some juried venue.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full Non-member Fee for the data and verify the student's need for data.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Fall 2012 program cycle is September 17, 2012, 11:59PM EST.

Spotlight on HAVIC

As part of our 20th anniversary celebration, the coming newsletters will include features that provide an overview of the broad range of LDC’s activities. To begin, we'll examine the Heterogeneous Audio Visual Internet Collection (HAVIC), one of the many projects handled by LDC’s Collection/Annotation Group led by Senior Associate Director Stephanie Strassel.

Under the supervision of Senior Research Coordinator Amanda Morris, the HAVIC team is developing a large corpus of unconstrained multimedia data drawn from user-generated videos on the web and annotated for a variety of features. The HAVIC corpus has been designed with an eye toward providing increased challenges for both acoustic and video processing technologies, focusing on multi-dimensional variation inherent in user-generated content. Over the past three years the corpus has provided training, development and test data for the NIST TRECVID Multimedia Event Detection (MED) Evaluation Track, whose goal is to assemble core detection technologies into a system that can search multimedia recordings for user-defined events based on pre-computed metadata.

For each MED evaluation, LDC and NIST have collaborated to define many new events, including things like “making a cake” or “assembling a shelter”. Each event requires an Event Kit, consisting of a textual description of the event’s properties along with a few exemplar videos depicting the event. A large team of LDC data scouts search for videos that contain each event, along with videos that are only indirectly or superficially related to defined events plus background videos that are unrelated to any defined event. After finding suitable content, data scouts label each video for a variety of features including the presence of audio, visual or text evidence that a particular event has occurred. This work is done using LDC’s AScout framework, consisting of a browser plug-in, a database backend and processing scripts that together permit data scouts to efficiently search for videos, annotate the multimedia content, and initiate download and post-processing of the data. Collected data is converted to MPEG-4 format, with h.264 video encoding and AAC audio encoding, and the original video resolution and audio/video bitrates are retained.

To date, LDC has collected and labeled well over 100,000 videos as part of the HAVIC Project, and the corpus will ultimately comprise thousands of hours of labeled data. Look for portions of the corpus to appear among LDC’s future releases.

New publications

(1)English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure and is designed to allow language technology researchers to develop and evaluate the robustness of parsing methods in those web domains.

This release contains 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.

English Web Treebank is distributed via web download. 2012 Subscription Members will receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data by completing the LDC User Agreement for Non-members. The agreement can be faxed to +1 215 573 2175 or scanned and emailed to this address. The first fifty copies of this publication are being made available at no charge.

(2) GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 includes 29 source-translation document pairs, comprising 169,488 words of Arabic source text and its English translation. Data is drawn from eight distinct Arabic programs broadcast between 2004 and 2007 from Aljazeera, a regional broadcast programmer based in Doha, Qatar; and Nile TV, an Egyptian broadcaster. The programs in this release focus on current events topics.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures in the completed translations.

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 is distributed via web download. 2012 Subscription Members will receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora.

(3) Spanish TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Spanish texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.

Spanish TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are news stories and fiction from the AnCora corpus.

The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including structure, syntax, dependencies, semantics and pragmatics. That information is not included in this release, but it can be mapped to the present annotations. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC).

Spanish TimeBank 1.0 is distributed by web download. 2012 Subscription Members will receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data by completing the LDC User Agreement for Non-members. The agreement can be faxed to +1 215 573 2175 or scanned and emailed to this address. The publication is being made available at no charge.

Philadelphia, PA; Mountain View, CA, August 16, 2012 (443 words)

Google Inc. (NASDAQ: GOOG) and the Linguistic Data Consortium (LDC) at the University of Pennsylvania have collaborated to develop new syntactically-annotated language resources that enable computers to better understand human language. The project, funded through a gift from Google in 2010, has resulted in the development of the English Web Treebank LDC2012T13, containing over 250,000 words of weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure. This resource will allow language technology researchers to develop and evaluate the robustness of parsing methods in various new web domains. It was used in the 2012 shared task on parsing English web text for the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL), https://sites.google.com/site/sancl2012/, which took place at NAACL-HLT in Montreal on June 8, 2012. The English Web Treebank is available to the research community through LDC’s Catalog, http://www.ldc.upenn.edu/Catalog/.

Natural language processing (NLP) is a field of computational linguistic research concerned with the interactions between human language and computers. Parsing is a discipline within NLP in which computers analyze text and determine its syntactic structure. While syntactic parsing is already practically useful, Google funded this effort to help the research community develop better parsers for web text. The web texts collected and annotated by LDC provide new, diverse data for training parsing systems.

Google chose LDC for this work based on the Consortium’s experience in developing and creating syntactic annotations, also known as treebanks. Treebanks are critically important to parsing research since they provide human-analyzed sentence structures that facilitate training and testing scenarios in NLP research. This work extends the existing relationship between LDC and Google. LDC has published four other Google-developed data sets in the past six years: English, Chinese, Japanese and European language n-grams used principally for language modeling.

Google is an industry-leading multinational organization headquartered in Mountain View, CA that develops Internet-based services and products and whose research includes work on NLP technologies. LDC is hosted by the University of Pennsylvania and was founded in 1992 by LDC Director, Dr. Mark Y. Liberman, Christopher H. Browne Distinguished Professor of Linguistics at the University of Pennsylvania. LDC is a nonprofit consortium that produces and distributes linguistic resources to researchers, technology developers and universities around the globe. The Penn Treebank, developed at the University of Pennsylvania over 20 years ago, is distributed by LDC and continues to be an important resource for the NLP community.

The Google collections, as well as all other LDC data publications, can be found in the LDC Catalog, www.ldc.upenn.edu/Catalog, which contains over 500 holdings.

-30-

Media Contact

Marian Reed

Marketing Coordinator

Linguistic Data Consortium

mreed@ldc.upenn.edu

+1.215.898.2561

Linguistic Data Consortium

Thursday, August 16, 2012

LDC August 2012 Newsletter

LDC and Google Collaboration Results in New Syntactically-Annotated Language Resources