LDC at
Interspeech 2014, Singapore
New publications:
LDC at Interspeech
2014, Singapore
LDC is off to Singapore to participate in Interspeech 2014. This year’s conference will be held from September 14-18 at Singapore’s Max Atria at the Expo Center. Please stop by LDC’s exhibition booth to learn more about recent developments at the Consortium and new publications. LDC will continue to post conference updates via our Facebook page. We hope to see you there!
LDC is off to Singapore to participate in Interspeech 2014. This year’s conference will be held from September 14-18 at Singapore’s Max Atria at the Expo Center. Please stop by LDC’s exhibition booth to learn more about recent developments at the Consortium and new publications. LDC will continue to post conference updates via our Facebook page. We hope to see you there!
New publications
(1) ACE 2007 Multilingual Training Corpus was developed by LDC and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.
(1) ACE 2007 Multilingual Training Corpus was developed by LDC and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.
The Arabic data is composed of newswire (60%)
published in October 2000-December 2000 and weblogs (40%)
published during the period November 2004-February 2005. The
Spanish data set consists entirely of newswire material from
multiple sources published in January 2005-April 2005. A document
pool was established for each language based on genre and epoch
requirements. Humans reviewed the pool to select individual
documents suitable for ACE annotation, such as documents that were
representative of their genre and contained targeted ACE entity
types. One annotator completed the entity and temporal expression
(TIMEX2) markup in the first pass annotation. This work was
reviewed in the second pass by a senior annotator. TIMEX2 values
were normalized by an annotator specifically trained for that
task.
The table below describes the amount of data
included in the current release and its annotation status. Corpus
content for each language and data type is represented in the
three stages of annotation: first pass annotation (1P), second
pass annotation (2P) and TIMEX2 normalization and additional
quality control (NORM).
Arabic
|
||||||
Words
|
Files
|
|||||
1P
|
2P
|
NORM
|
1P
|
2P
|
NORM
|
|
NW
|
58,015
|
58,015
|
58,015
|
257
|
257
|
257
|
WL
|
40,338
|
40,338
|
40,338
|
121
|
121
|
121
|
Total
|
98,353
|
98,353
|
98,353
|
378
|
378
|
378
|
Spanish
|
||||||
Words
|
Files
|
|||||
1P
|
2P
|
NORM
|
1P
|
2P
|
NORM
|
|
NW
|
100,401
|
100,401
|
100,401
|
352
|
352
|
352
|
Total
|
100,401
|
100,401
|
100,401
|
352
|
352
|
352
|
For a given document, there is a source .sgm
file together with the .ag.xml and .apf.xml annotation files in
each of the three directories "1p", "2p" and "timex2norm". In
other words, for each newswire story or weblog entry, the three
annotation directories each contain an identical copy of the
source text (SGML .sgm file) along with distinct versions of the
associated annotations (XML .ag.xml, apf.xml files and plain text
.tab files). All files are presented in UTF-8.
ACE 2007 Multilingual Training Corpus is
distributed via web download.
2014 Subscription Members will automatically
receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(2) GALE
Arabic-English Word Alignment -- Broadcast Training Part 1
was developed by LDC and contains 267,257 tokens of word aligned
Arabic and English parallel text enriched with linguistic tags.
This material was used as training data in the DARPA GALE (Global
Autonomous Language Exploitation) program.
Some approaches to statistical machine
translation include the incorporation of linguistic knowledge in
word aligned text as a means to improve automatic word alignment
and machine translation quality. This is accomplished with two
annotation schemes: alignment and tagging. Alignment identifies
minimum translation units and translation relations by using
minimum-match and attachment annotation approaches. A set of word
tags and alignment link tags are designed in the tagging scheme to
describe these translation units and relations. Tagging adds
contextual, syntactic and language-specific features to the
alignment annotation.
This release consists of Arabic source
broadcast news and broadcast conversation data collected by LDC
from 2007-2009. The distribution by genre, words, tokens and
segments appears below:
Language
|
Genre
|
Files
|
Words
|
Tokens
|
Segments
|
Arabic
|
BC
|
231
|
79,485
|
103,816
|
4,114
|
Arabic
|
BN
|
92
|
131,789
|
163,441
|
7,227
|
Totals
|
323
|
211,274
|
267,257
|
11,341
|
Note that word count is based on the
untokenized Arabic source, and token count is based on the
tokenized Arabic source.
The Arabic word alignment tasks consisted of
the following components:
Normalizing tokenized tokens as needed
Identifying different types of links
Identifying sentence segments not suitable for
annotation
Tagging unmatched words attached to other words
or phrases
GALE Arabic-English Word Alignment -- Broadcast
Training Part 1 is distributed via web download.
2014 Subscription Members will automatically
receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(3) GALE Phase 2
Chinese Newswire Parallel Text Part 2 was developed by LDC.
Along with other corpora, the parallel text in this release
comprised training data for Phase 2 of the DARPA GALE (Global
Autonomous Language Exploitation) Program. This corpus contains
117,895 tokens of Chinese source text and corresponding English
translations selected from newswire data collected by LDC in 2007
and translated by LDC or under its direction.
This release includes 177 source-translation
document pairs, comprising 117,895 tokens of translated data. Data
is drawn from four distinct Chinese newswire sources: China News
Service, Guangming Daily, People's Daily and People's Liberation
Army Daily.
Data was manually selected for translation
according to several criteria, including linguistic features and
topic features. The files were formatted into a human-readable
translation format and assigned to translation vendors.
Translators followed LDC's Chinese to English translation
guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations.
Source data and translations are distributed in
TDF format. TDF files are tab-delimited files containing one
segment of text along with meta information about that segment.
Each field in the TDF file is described in TDF_format.text. All
data are encoded in UTF-8.
GALE Phase 2 Chinese Newswire Parallel Text
Part 2 is distributed via web download.
2014 Subscription Members will automatically
receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.