Linguistic Data Consortium: LDC and Google Collaboration Results in New Syntactically-Annotated Language Resources

Thursday, August 16, 2012

LDC and Google Collaboration Results in New Syntactically-Annotated Language Resources

Philadelphia, PA; Mountain View, CA, August 16, 2012 (443 words)

Google Inc. (NASDAQ: GOOG) and the Linguistic Data Consortium (LDC) at the University of Pennsylvania have collaborated to develop new syntactically-annotated language resources that enable computers to better understand human language. The project, funded through a gift from Google in 2010, has resulted in the development of the English Web Treebank LDC2012T13, containing over 250,000 words of weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure. This resource will allow language technology researchers to develop and evaluate the robustness of parsing methods in various new web domains. It was used in the 2012 shared task on parsing English web text for the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL), https://sites.google.com/site/sancl2012/, which took place at NAACL-HLT in Montreal on June 8, 2012. The English Web Treebank is available to the research community through LDC’s Catalog, http://www.ldc.upenn.edu/Catalog/.

Natural language processing (NLP) is a field of computational linguistic research concerned with the interactions between human language and computers. Parsing is a discipline within NLP in which computers analyze text and determine its syntactic structure. While syntactic parsing is already practically useful, Google funded this effort to help the research community develop better parsers for web text. The web texts collected and annotated by LDC provide new, diverse data for training parsing systems.

Google chose LDC for this work based on the Consortium’s experience in developing and creating syntactic annotations, also known as treebanks. Treebanks are critically important to parsing research since they provide human-analyzed sentence structures that facilitate training and testing scenarios in NLP research. This work extends the existing relationship between LDC and Google. LDC has published four other Google-developed data sets in the past six years: English, Chinese, Japanese and European language n-grams used principally for language modeling.

Google is an industry-leading multinational organization headquartered in Mountain View, CA that develops Internet-based services and products and whose research includes work on NLP technologies. LDC is hosted by the University of Pennsylvania and was founded in 1992 by LDC Director, Dr. Mark Y. Liberman, Christopher H. Browne Distinguished Professor of Linguistics at the University of Pennsylvania. LDC is a nonprofit consortium that produces and distributes linguistic resources to researchers, technology developers and universities around the globe. The Penn Treebank, developed at the University of Pennsylvania over 20 years ago, is distributed by LDC and continues to be an important resource for the NLP community.

The Google collections, as well as all other LDC data publications, can be found in the LDC Catalog, www.ldc.upenn.edu/Catalog, which contains over 500 holdings.

-30-

Media Contact

Marian Reed

Marketing Coordinator

Linguistic Data Consortium

mreed@ldc.upenn.edu

+1.215.898.2561

Linguistic Data Consortium

Thursday, August 16, 2012

LDC and Google Collaboration Results in New Syntactically-Annotated Language Resources

No comments:

Post a Comment