Philadelphia, PA; Mountain View, CA, August 16, 2012 (443 words)
Google Inc. (NASDAQ: GOOG) and the
Linguistic Data Consortium (LDC) at the University of Pennsylvania have
collaborated to develop new syntactically-annotated language resources that
enable computers to better understand human language. The project, funded through
a gift from Google in 2010, has resulted in the development of the English Web
Treebank LDC2012T13, containing over 250,000 words of weblogs, newsgroups,
email, reviews and question-answers manually annotated for syntactic structure.
This resource will allow language technology researchers to develop and evaluate
the robustness of parsing methods in various new web domains. It was used in
the 2012 shared task on parsing English web text for the First Workshop on
Syntactic Analysis of Non-Canonical Language (SANCL), https://sites.google.com/site/sancl2012/,
which took place at NAACL-HLT in Montreal on June 8, 2012. The English Web
Treebank is available to the research community through LDC’s Catalog, http://www.ldc.upenn.edu/Catalog/.
Natural language processing (NLP) is a field of
computational linguistic research concerned with the interactions between human
language and computers. Parsing is a discipline within NLP in which computers
analyze text and determine its syntactic structure. While syntactic parsing is
already practically useful, Google funded this effort to help the research
community develop better parsers for web text. The web texts collected and
annotated by LDC provide new, diverse data for training parsing systems.
Google chose LDC
for this work based on the Consortium’s experience in developing and creating
syntactic annotations, also known as treebanks. Treebanks are critically
important to parsing research since they provide human-analyzed sentence
structures that facilitate training and testing scenarios in NLP research. This
work extends the existing relationship between LDC and Google. LDC has published four other Google-developed
data sets in the past six years: English, Chinese, Japanese and European
language n-grams used principally for language modeling.
Google is an
industry-leading multinational organization headquartered in Mountain View, CA
that develops Internet-based services and products and whose research includes
work on NLP technologies. LDC is hosted by the University of Pennsylvania and
was founded in 1992 by LDC Director, Dr. Mark Y. Liberman, Christopher H.
Browne Distinguished Professor of Linguistics at the University of Pennsylvania.
LDC is a nonprofit consortium that produces and distributes linguistic
resources to researchers, technology developers and universities around the
globe. The Penn Treebank, developed at the University of Pennsylvania over 20
years ago, is distributed by LDC and continues to be an important resource for
the NLP community.
The Google
collections, as well as all other LDC data publications, can be found in the
LDC Catalog, www.ldc.upenn.edu/Catalog,
which contains over 500 holdings.
-30-
Media Contact
Marian Reed
Marketing Coordinator
Linguistic Data
Consortium
+1.215.898.2561
No comments:
Post a Comment