New Publications:
______________________________________________________________________
(1) BOLT
Arabic Discussion Forums was developed by LDC and consists of 813,080
discussion forum threads in Egyptian Arabic harvested from the Internet using a
combination of manual and automatic processes. The DARPA BOLT
(Broad Operational Language Translation) program developed machine translation
and information retrieval for less formal genres, focusing particularly on user-generated
content. The material in this release represents the unannotated Arabic source
data in the discussion forum genre.
Collection was seeded based on the results of manual data
scouting by native speaker annotators. Scouts were instructed to seek content
in Egyptian Arabic that was original, interactive and informal. Upon locating
an appropriate thread, scouts submitted the URL and some simple judgments about
it to a database, via a web browser plug-in. The scale of the collection
precluded manual review of all data. Only a small portion of the threads
included in this release were manually reviewed, and it is expected that there
may be some offensive or otherwise undesired content as well as some threads
that contain a large amount of non-Arabic content. It should also be noted that
many threads may contain a mixture of Egyptian and other varieties of Arabic,
even among the threads that are primarily Arabic.
BOLT Arabic Discussion Forums is distributed via web
download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(2) LORELEI
Somali Representative Language Pack - Monolingual and Parallel Text was
developed by LDC and is comprised of approximately 13 million words of
monolingual Somali text, approximately 800,000 of which are translated into
English. Another 100,000 words are also translated from English into Somali.
The LORELEI (Low Resource Languages for Emergent Incidents) Program is
concerned with building Human Language Technology for low resource languages in
the context of emergent situations like natural disasters or disease outbreaks.
Data was collected in the following genres: discussion
forums, news, reference, social network and weblog. Both monolingual text
collection and parallel text creation involved a combination of manual and
automatic methods, which are detailed in the included documentation. All
harvested content was initially converted from its original HTML form into a
relatively uniform XML format. Also included in this release are two tools: one
to recreate original source data from the processed XML material and the other
to condition text data users download from Twitter.
LORELEI Somali Representative Language Pack - Monolingual
and Parallel Text is distributed via web download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
(3) SPADE
(Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated
parse trees and alignment on English sentential paraphrases extracted from
machine translation evaluation corpora and separated into development and test
sets.
Reference translations from machine translation evaluation
corpora were used as sentential paraphrases. They were sourced from the
following data sets released by LDC from the NIST (National Institute of
Standards and Technology) open machine translation evaluation series (OpenMT):
LDC2010T14, LDC2010T17, LDC2010T21, and LDC2013T03.
Reference translations of 10 to 30 words were randomly
extracted for annotation from NIST OpenMT corpora. Gold standard annotations of
HPSG (head-driven phrase structure grammar) trees and phrase alignments were
performed, resulting in 20,276 phrases extracted from 201 sentential
paraphrases and 15,721 paraphrase alignments.
SPADE is distributed via web download.
2018 Subscription Members will
receive copies of this corpus. 2018 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data for a
fee.