New publications:
GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
_____________________________________________________________________
New Corpora
(1) DEFT
Narrative Text was developed by LDC and contains proxy reports and their
source newswire used to support DARPA's Deep Exploration and Filtering of Text
(DEFT) program. One of the goals of the DEFT program was to develop
technologies that can perform various NLP tasks on data in a variety of genres,
both formal and informal.
LDC provided source data and annotations for DEFT system
development. DEFT Narrative Text consists of "proxy reports" (and
"multi-proxy reports") in English. (Multi-)proxy reports are intended
to mimic the format and other features of some types of government analyst
reports using content from newswire articles. The corresponding English
newswire source documents are also included in the release.
LDC staff manually selected the source newswire from English
Gigaword Fifth Edition (LDC2011T07).
The newswire source documents are XML files following the
Gigaword corpus format. The proxy reports are in plain text format.
DEFT Narrative Text is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for
a fee.
*
(2) GALEPhase 3 and 4 Arabic Web Parallel Text was developed by LDC. Along
with other corpora, the parallel text in this release comprised training data
for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program. This corpus contains Modern Standard Arabic source text and
corresponding English translations selected from weblog and newsgroup data
collected by LDC and translated by LDC or under its direction.
The data includes 124 source-translation document pairs,
comprising 61,662 tokens of Arabic source text and its English translation.
Data is drawn from four various Arabic weblog and newsgroup sources.
GALE Phase 3 and 4 Arabic Web Parallel Text is
distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for
a fee.
*
(3) GALEPhase 3 and 4 Chinese Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the
parallel text in this release comprised training data for Phases 3 and 4 of the
DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains
Chinese source text and corresponding English translations selected from
broadcast conversation data collected by LDC between 2006 and 2008 and
transcribed and translated by LDC or under its direction.
This data includes 63 source-translation document pairs,
comprising 487,466 tokens of Chinese source text and its English translation.
Data is drawn from 19 distinct Chinese programs broadcast between 2006 and 2008.
Data was manually selected for translation according to
several criteria, including linguistic features, transcription features and
topic features. The transcribed and segmented files were then reformatted into
a human-readable translation format and assigned to translation vendors.
Translators followed LDC's Chinese to English translation guidelines. Bilingual
LDC staff performed quality control procedures on the completed translations.
GALE Phase 3 and 4 Chinese Broadcast Conversation
Parallel Text is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for
a fee.
No comments:
Post a Comment