New Publications:
BOLT English PropBank and
Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech
LORELEI Tigrinya
Incident Language Pack
Chinese Lexical
Resources for Gender, Number, Animacy
New publications:
(1) BOLT English PropBank
and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech
was developed by the University
of Colorado, Boulder – CLEAR (Computational Language and Education Research)
and consists of propbank and verb sense disambiguation annotation on English
discussion forum (DF), SMS/Chat, and conversational telephone speech data.
Annotation was applied to each predicate verb tree in LDC’s BOLT phrase
structure treebanks. PropBank provides a layer of semantic annotation over
treebank and was performed on all three genres. DF and SMS/Chat data were also
annotated for verb sense disambiguation using Verbnet
3.2 classes.
The DARPA BOLT
(Broad Operational Language Translation) program developed machine translation
and information retrieval for less formal genres, focusing particularly on
user-generated content. LDC supported the BOLT program by collecting informal
data sources -- discussion forums, text messaging, and chat -- in Chinese,
Egyptian Arabic, and English. The collected data was translated and annotated
for various tasks including word alignment, treebanking, propbanking, and
co-reference.
BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational
Telephone Speech is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may this data for a fee.
*
(2) LORELEI Tigrinya Incident
Language Pack was developed by LDC and is comprised
of approximately 4.5 million words of Tigrinya monolingual text, 25,000 words
of English monolingual text, 235,000 words of parallel and comparable
Tigrinya-English text, and 50,000 words of data annotated for Entity Discovery
and Linking and for Situation Frames. It contains all of the text data,
annotations, supplemental resources, and related software tools for the Tigrinya
language that were used in the DARPA LORELEI /
LoReHLT 2017 Evaluation.
The evaluation protocol was based on a scenario in which an
unforeseen event triggered a need for humanitarian and logistical support in a
region where the incident language had received little or no attention in NLP
research. Evaluation participants provided NLP solutions, including information
extraction and machine translation, with limited resources and limited
development time.
Data was collected from news, social network, weblog, newsgroup, discussion
forum, and reference material. Entity Detection and Linking and Situation Frame
annotations identified “entities,” “needs” (such as a need for food), and “issues”
(such as civil unrest) to be detected by systems for scoring purposes.
Situation frame analysis was designed to extract basic information that would
be useful for planning a disaster response effort.
The knowledge base for the entity linking annotation in this corpus is
available separately as LORELEI
Entity Detection and Linking Knowledge Base (LDC2020T10).
LORELEI Tigrinya Incident Language Pack is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.
*
(3)
Chinese Lexical Resources for Gender, Number,
Animacy was developed by LDC and consists of gender, number,
and animacy lexicons produced in support of the DARPA DEFT program. Gender,
number, and animacy are lexical indicators useful for named entity tagging,
including the detection of person mentions in text.
This corpus was created by extracting information from newswire texts in Chinese Gigaword Fifth Edition (LDC2011T13)
in the following steps: (1) segmenting source documents into sentences; (2)
converting any traditional Chinese script to simplified Chinese; (3) tagging
all sentences for parts-of-speech; (4) developing queries to detect patterns;
and (5) building lexicons based on frequency counts and entity types.
The resulting resources include dictionaries of Chinese animate nominals and
names; Chinese nominals and name with gender and number predicted; and other
dictionaries of Chinese nominals, names, verbs, and pronouns. Each dictionary
contains frequency information as well as the features in question.
DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address
remaining capability gaps in state-of-the-art natural language processing
technologies related to inference, causal relationships and anomaly detection.
LDC supported the DEFT program by collecting, creating and annotating a variety
of data sources.
Chinese Lexical Resources for Gender, Number, Animacy is distributed via web
download.
No comments:
Post a Comment