CFP for LREC 2016
Novel Incentives Workshop
LDC
Membership Discounts for MY 2016 Still Available
New publications:
______________________________________________________________
CFP for LREC 2016
Novel Incentives Workshop
The first workshop on novel incentives in linguistic data
collection will take place on May 28, 2016 in conjunction with the Tenth
International Conference on Language Resources and Evaluation (LREC2016)
in Portoroz, Slovenia.
Novel
Incentives for Collecting Linguistic Data and Annotation from People: types,
implementation, tasking requirements, workflow and results, opens the
discussion on incentives in data collection describing novel approaches and
comparing traditional monetary incentives.
The workshop is accepting papers through February 6,
2016. For more information visit the workshop
webpage.
LDC
Membership Discounts for MY 2016 Still Available
If you are considering joining LDC for Membership Year
2016 (MY2016), there is still time to save on membership fees. Any
organization which joins or renews membership for 2016 through March 1, 2016,
is entitled to a 5% discount on membership fees. Organizations which held
membership for MY2015 can receive a 10% discount on fees provided they renew
prior to March 1, 2016. Publications planned for release in 2016 include
multilingual language packs, BOLT discussion forum and DEFT narrative text
corpora, HAVIC video clips and transcripts and the latest Arabic and Chinese
treebanks.
New publications
(1) Arabic Treebank -
Weblog was developed by LDC and
consists of Arabic weblog data with part-of-speech, morphology, gloss and
syntactic tree annotation.
The ongoing Penn Arabic
Treebank Project (PATB) supports research in Arabic-language natural language
processing and human language technology development. Generally, the PATB
consists of two distinct phases: (a) part-of-speech (POS) tagging, which
divides the text into lexical tokens and gives relevant information about each
token such as lexical category, inflectional features, and a gloss (referred to
as POS for convenience, although it includes morphological and gloss
information not traditionally included with part-of-speech annotation), and (b)
Arabic treebanking, which characterizes the constituent structures of word
sequences, provides categories for each non-terminal node, and identifies null
elements, co-reference, traces and so on.
The data contains 243,117
source tokens before clitics were split, and 308,996 tree tokens after clitics
were separated for treebank annotation. The source material is weblogs
collected by LDC from various sources.
Arabic Treebank - Weblog is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.
*
(2) NewSoMe Corpus of Opinion in Blogs was compiled at Barcelona Media and consists of
English and Spanish blogs annotated for opinions. It is part of the NewSoMe
(News and Social Media) set of corpora presenting opinion annotations across
several genres and covering multiple languages. NewSoMe is the result of an
effort to build a unifying annotation framework for analyzing opinion in
different genres, ranging from controlled text, such as news reports, to
diverse types of user-generated content that includes blogs, product reviews
and microblogs.LDC has also released NewSoMe Corpus of Opinion in News Reports (LDC2015T17).
The data consists of 108 English documents and 191 Spanish documents. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.
NewSoMe Corpus of Opinion in Blogs is distributed via web
download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.
*
(3) GALE
Phase 4 Chinese Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel
text in this release comprised training data for Phase 4 of the DARPA GALE
(Global Autonomous Language Exploitation) Program. This corpus contains Chinese
source sentences and corresponding English translations selected from newsgroup
and weblog data collected by LDC and translated by LDC or under its direction.
Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.
GALE Phase 4 Chinese Weblog
Parallel Sentences is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.