New Publications
_______________________________________________________________
New Publications
(1) GALE Chinese-English Word Alignment and Tagging -- Broadcast
Training Part 4 was developed by LDC and contains
243,038 tokens of word aligned Chinese and English parallel text enriched with
linguistic tags. This material was used as training data in the DARPA GALE
(Global Autonomous Language Exploitation) program.
Some approaches to statistical
machine translation include the incorporation of linguistic knowledge in word
aligned text as a means to improve automatic word alignment and machine
translation quality.
This is accomplished with two annotation schemes:
alignment and tagging. Alignment identifies minimum translation units and
translation relations by using minimum-match and attachment annotation
approaches. A set of word tags and alignment link tags are designed in the
tagging scheme to describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the alignment
annotation.
This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:
Language
|
Genre
|
Files
|
Words
|
CharTokens
|
Segments
|
Chinese
|
BC
|
69
|
67,782
|
101,674
|
2,276
|
Chinese
|
BN
|
29
|
94,242
|
141,364
|
3,152
|
Total
|
98
|
162,024
|
243,038
|
5,428
|
Note that all token counts are based
on the Chinese data only. One token is equivalent to one character and one word
is equivalent to 1.5 characters.
The Chinese word alignment tasks
consisted of the following components:
- Identifying, aligning, and tagging eight different
types of links
- Identifying, attaching, and tagging local-level
unmatched words
- Identifying and tagging sentence/discourse-level
unmatched words
- Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link
GALE Chinese-English Word Alignment
and Tagging -- Broadcast Training Part 4 is distributed via web download.
2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) GALE
Phase 3 and 4 Arabic Newswire Parallel Text was developed by LDC.
Along with other corpora, the parallel text in this release comprised training
data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Modern Standard Arabic source text
and corresponding English translations selected from newswire data collected by
LDC in 2007 and 2008 and transcribed and translated by LDC or under its
direction.
This data includes 551
source-translation document pairs, comprising 156,775 tokens of Arabic source
text and its English translation. Data is drawn from seven distinct Arabic
newswire sources: Agence France Presse, Al Ahram, Al Hayat, Al-Quds Al-Arabi,
An Nahar, Asharq Al-Awsat and Assabah.
The files in this release were
transcribed by LDC staff and/or transcription vendors under contract to LDC in
accordance with the Quick Rich Transcription guidelines developed by LDC. The
transcribed and segmented files were reformatted into a human-readable
translation format and assigned to translation vendors. Translators followed
LDC's Arabic to English translation guidelines. Bilingual LDC staff performed
quality control procedures on the completed translations. Source data and
translations are distributed in TDF format.
GALE Phase 3 and 4 Arabic Newswire
Parallel Text is distributed via web download.
2015 Subscription Members will
automatically receive two copies of this corpus on disc. 2015 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(3) NewSoMe
Corpus of Opinion in News Reports was compiled at Barcelona
Media and consists of Spanish, Catalan
and Portuguese news reports annotated for opinions. It is part of the NewSoMe
(News and Social Media) set of corpora presenting opinion annotations across
several genres and covering multiple languages. NewSoMe is the result of an
effort to build a unifying annotation framework for analyzing opinion in
different genres, ranging from controlled text, such as news reports, to
diverse types of user-generated content that includes blogs, product reviews
and microblogs.
The source data in this release was
obtained from various newspaper websites and consists of approximately 200
documents in each of Spanish, Catalan and Portuguese. The annotation was
carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for
this data set. The layers annotated were topic, segment, cue, subjectivity,
polarity and intensity.
NewSoMe Corpus of Opinion in News
Reports is distributed via web download.
2015
Subscription Members will automatically receive two copies of this corpus on
disc. 2015 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.