LDC at
NWAV 43
LDC Data Scholarship Update
New publications:
Chinese Discourse Treebank 0.5
GALE Arabic-English Word Alignment -- Broadcast Training Part 2
United Nations Proceedings Speech ________________________________________________________________
LDC at NWAV 43
LDC will be exhibiting at the 43rd New Ways of Analyzing Variation Conference (NWAV 43) held this year October 23-26 in Chicago, Illinois. Please stop by our table in the Old Town Room on the third floor of the Hilton to learn more about the most recent developments at the Consortium and to check out our latest giveaways. As always, LDC will post conference updates via our Facebook page. We hope to see you in Chicago!
LDC Data Scholarship Update
LDC received many
solid applications for the Fall 2014 LDC Data Scholarship Program. We are
in the process of reviewing submissions and will announce recipients soon. The
LDC Data Scholarship program provides university students with access to
LDC data at no-cost. Students were asked to complete an application which
consisted of a proposal describing their intended use of the data, as well as a
letter of support from their thesis adviser.
Data use
proposals in this cycle included a range of research interests from opinion
mining tagging to deceptive speech classification.
New publications
(1) Chinese
Discourse Treebank 0.5 was developed at Brandeis University as part
of the Chinese Treebank Project and consists of approximately 73,000
words of Chinese newswire text annotated for discourse relations. It follows
the lexically grounded approach of the Penn Discourse Treebank (PDTB) (LDC2008T05)
with adaptations based on the linguistic and statistical characteristics of Chinese
text. Discourse relations are lexically anchored by discourse connectives
(e.g., because, but, therefore), which are viewed as predicates that take
abstract objects such as propositions, events and states as their arguments.
Along with PDTB-style schemes for English, Turkish, Hindi and Czech, Chinese
Discourse Treebank provides an additional perspective on how the PDTB approach
can be extended for cross-lingual annotation of discourse relations.
Data was selected
from the newswire material in Chinese Treebank 8.0 (LDC2013T21),
specifically, from Xinhua News Agency stories. There are approximately 5,500
annotation instances. Following the PDTB format, each annotation instance
consists of 27 vertical bar delimited fields. The fields specify the attributes
of the discourse relation as a whole, as well as the attributes of its two
arguments. Not all fields are filled in this release. Filled fields are
indicated by a pair of angle brackets; the remaining fields are place holders
for future releases.
Chinese Discourse
Treebank 0.5 is distributed via web download.
2014 Subscription
Members will automatically receive two copies of this data on disc. 2014
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.
*
This release
consists of Arabic source broadcast news and broadcast conversation data
collected by LDC from 2007-2009.The Arabic word alignment tasks consisted of
the following components:
Normalizing
tokenized tokens as needed
Identifying
different types of links
Identifying
sentence segments not suitable for annotation
Tagging unmatched
words attached to other words or phrases
GALE
Arabic-English Word Alignment – Broadcast Training Part 2 is distributed via
web download.
2014 Subscription
Members will automatically receive two copies of this data on disc. 2014
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.
*
(3) United
Nations Proceedings Speech was developed by the United Nations
(UN) and contains approximately 8,500 hours of recorded proceedings in the six
official UN languages, Arabic, Chinese, English, French, Russian and Spanish.
The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly
(GA) and First
Committee (FC) (Disarmament and International Security), and
meetings 6434-6763 of the Security Council.
Recordings were
made using a customized system following a daily internal circulated
instruction from the Meetings Management Section. Most of the subjects
and information related to a particular meeting or session are published in a
UN Journal which can be found in the following here.
Data is presented
either as mp3 or flac compressed wav and are 16-bit single channel files in
either 22,050 or 8,000 Hz organized by committee and session number, then
language. The folder labeled "Floor" indicates the microphone used by
the particular speaker. Those files may include other languages, for instance,
if the speaker's language was not among the six official UN languages.
United Nations
Proceedings Speech is distributed on one hard drive.
2014 Subscription Members will receive one copy of this data, provided they have completed the user license agreement. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.