New publications:
Domain-Specific Hyponym Relations
GALE Arabic-English Parallel Aligned Treebank -- WebTraining
Multi-Channel WSJ Audio
Domain-Specific Hyponym Relations
GALE Arabic-English Parallel Aligned Treebank -- WebTraining
Multi-Channel WSJ Audio
(1) Domain-Specific Hyponym Relations was developed by the Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Technology at Xi’an Jiaotung University, Xi’an, Shaanxi, China. It provides more than 5,000 English hyponym relations in five domains including data mining, computer networks, data structures, Euclidean geometry and microbiology. All hypernym and hyponym words were taken from Wikipedia article titles.
A hyponym relation is a word sense
relation that is an IS-A relation. For example, dog is a hyponym of animal and
binary tree is a hyponym of tree structure. Among the applications for
domain-specific hyponym relations are taxonomy and ontology learning, query
result organization in a faceted search and knowledge organization and
automated reasoning in knowledge-rich applications.
The data is presented in XML format,
and each file provides hyponym relations in one domain. Within each file, the
term, Wikipedia URL, hyponym relation and the names of the hyponym and hypernym
words are included. The distribution of terms and relations is set forth in the
table below:
Dataset
|
Terms
|
Hyponym Relations
|
Data Mining
|
278
|
364
|
Computer Network
|
336
|
399
|
Data Structure
|
315
|
578
|
Euclidean Geometry
|
455
|
690
|
Microbiology
|
1,028
|
3,533
|
Domain-Specific Hyponym Relations is distributed via web download.
2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. This data is made available at no-cost to LDC members and non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license.
*
(2) GALE Arabic-English Parallel Aligned Treebank -- Web Training was developed by LDC and contains 69,766 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.
Parallel aligned treebanks are
treebanks annotated with morphological and syntactic structures aligned at the
sentence level and the sub-sentence level. Such data sets are useful for
natural language processing and related fields, including automatic word
alignment system training and evaluation, transfer-rule extraction, word sense
disambiguation, translation lexicon extraction and cultural heritage and
cross-linguistic studies. With respect to machine translation system
development, parallel aligned treebanks may improve system performance with
enhanced syntactic parsers, better rules and knowledge about language pairs and
reduced word error rate.
In this release, the source Arabic
data was translated into English. Arabic and English treebank annotations were
performed independently. The parallel texts were then word aligned.
LDC previously released
Arabic-English Parallel Aligned Treebanks as follows:
This release consists of Arabic
source web data (newsgroups, weblogs) collected by LDC in 2004 and 2005. All
data is encoded as UTF-8. A count of files, words, tokens and segments is
below.
Language
|
Files
|
Words
|
Tokens
|
Segments
|
Arabic
|
162
|
46,710
|
69,766
|
3,178
|
Note: Word count is based on the
untokenized Arabic source, token count is based on the ATB-tokenized Arabic
source.
The purpose of the GALE word
alignment task was to find correspondences between words, phrases or groups of
words in a set of parallel texts. Arabic-English word alignment annotation
consisted of the following tasks:
- Identifying different types of links: translated
(correct or incorrect) and not translated (correct or incorrect)
- Identifying sentence segments not suitable for
annotation, e.g., blank segments, incorrectly-segmented segments, segments
with foreign languages
- Tagging unmatched words attached to other words or phrases
GALE Arabic-English Parallel Aligned Treebank -- Web Training is distributed via web download.
2014 Subscription Members will
automatically receive two copies of this data on disc. 2014 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(3) Multi-Channel
WSJ Audio (MCWSJ) was developed by the Centre for Speech
Technology Research at the University of Edinburgh and contains
approximately 100 hours of recorded speech from 45 British English speakers.
Participants read Wall Street Journal texts published in 1987-1989 in three recording
scenarios: a single stationary speaker, two stationary overlapping speakers and
one single moving speaker.
This corpus was designed to address
the challenges of speech recognition in meetings, which often occur in rooms
with non-ideal acoustic conditions and significant background noise, and may
contain large sections of overlapping speech. Using headset microphones
represents one approach, but meeting participants may be reluctant to wear
them. Microphone arrays are another option. MCWSJ supports research in large
vocabulary tasks using microphone arrays. The news sentences read by speakers
are taken from WSJCAM0 Cambridge Read News, a corpus originally
developed for large vocabulary continuous speech recognition experiments, which
in turn was based on CSR-I (WSJ0) Complete, made available by LDC to
support large vocabulary continuous speech recognition initiatives.
Speakers reading news text from
prompts were recorded using a headset microphone, a lapel microphone and an
eight-channel microphone array. In the single speaker scenario, participants
read from six fixed positions. Fixed positions were assigned for the entire
recording in the overlapping scenario. For the moving scenario, participants
moved from one position to the next while reading.
Fifteen speakers were recorded for
the single scenario, nine pairs for the overlapping scenario and nine
individuals for the moving scenario. Each read approximately 90 sentences.
Multi-Channel WSJ Audio is
distributed on 2 DVD-ROM.
2014 Subscription Members will
receive a copy of this data provided that they have completed the User License Agreement for Multi-Channel WSJ Audio
LDC2014S03. 2014 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a
fee.