Linguistic Data Consortium: LDC March 2016 Newsletter

Tuesday, March 15, 2016

LDC March 2016 Newsletter

New publications:

GALE Phase 3 and 4 Arabic Web Parallel Text
GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
_____________________________________________________________________

New Corpora

(1) DEFT Narrative Text was developed by LDC and contains proxy reports and their source newswire used to support DARPA's Deep Exploration and Filtering of Text (DEFT) program. One of the goals of the DEFT program was to develop technologies that can perform various NLP tasks on data in a variety of genres, both formal and informal.

LDC provided source data and annotations for DEFT system development. DEFT Narrative Text consists of "proxy reports" (and "multi-proxy reports") in English. (Multi-)proxy reports are intended to mimic the format and other features of some types of government analyst reports using content from newswire articles. The corresponding English newswire source documents are also included in the release.

LDC staff manually selected the source newswire from English Gigaword Fifth Edition (LDC2011T07).

The newswire source documents are XML files following the Gigaword corpus format. The proxy reports are in plain text format.

DEFT Narrative Text is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALEPhase 3 and 4 Arabic Web Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.

The data includes 124 source-translation document pairs, comprising 61,662 tokens of Arabic source text and its English translation. Data is drawn from four various Arabic weblog and newsgroup sources.

GALE Phase 3 and 4 Arabic Web Parallel Text is distributed via web download.

(3) GALEPhase 3 and 4 Chinese Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast conversation data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

This data includes 63 source-translation document pairs, comprising 487,466 tokens of Chinese source text and its English translation. Data is drawn from 19 distinct Chinese programs broadcast between 2006 and 2008.

Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text is distributed via web download.

Linguistic Data Consortium

Tuesday, March 15, 2016

LDC March 2016 Newsletter

No comments:

Post a Comment