New Publications:
________________________________________________________________________
New publications:
(1) 2015-2016 CoNLL Shared Task
contains the Chinese and English training, development and test data for
the 2015 and 2016 CoNLL
(Conference on Computational Natural Language Learning) Shared Task Evaluation
which focused on shallow discourse parsing. This release consists of the
tokenized, tagged, and parsed tags in English and Chinese. The English train,
dev and test data are from Wall Street Journal material in Penn Discourse
Treebank Version 2.0 (LDC2008T05);
English blind test data are from wikinews. Chinese train, dev and test data are
news material from Chinese Discourse Treebank 0.5 (LDC2014T21); Chinese blind
test data are from wikinews.
LDC has also released the following CoNLL Shared Task
data sets:
·
2006 CoNLL Shared Task - Ten Languages (LDC2015T11)
·
2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12)
·
2008 CoNLL Shared Task Data (LDC2009T12)
·
2009 CoNLL Shared Task Part 1 (LDC2012T03)
·
2009 CoNLL Shared Task Part 2 (LDC2012T04)
2015-2016 CoNLL Shared Task is distributed via web download.
2017 Subscription
Members will receive copies of this corpus. 2017 Standard Members may request a
copy as part of their 16 free membership corpora. Non-members may license this
data for a fee.
*
(2) IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the IARPA (Intelligence
Advanced Research Projects Activity) Babel program. It contains
approximately 211 hours of Zulu conversational and scripted telephone speech
collected in 2012 and 2013 along with corresponding transcripts.
The Babel program focuses on underserved languages and seeks to
develop speech recognition technology that can be rapidly applied to any human
language to support keyword search performance over large amounts of recorded
speech.
The Zulu speech in this release represents that spoken in the KZN (KwaZulu-Natal)-urban
dialect region of South Africa. The gender distribution among speakers is
approximately equal; speakers' ages range from 16 years to 70 years. Calls were
made using different telephones (e.g., mobile, landline) from a variety of
environments including the street, a home or office, a public place, and inside
a vehicle.
IARPA Babel Zulu
Language Pack IARPA-babel206b-v0.1e is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(3) SRI-FRTIV
(Five-way Recorded Toastmaster Intrinsic Variation) was developed by SRI International in 2007-2008 and is comprised
of approximately 232 hours of English speech from thirty-four speakers who were
members of Toastmaster clubs.
Participants were asked to speak at three different levels of effort (low,
normal and high) in four different styles (interview, conversation, reading and
oration) to study the question of how intrinsic variations -- associated with
the speaker rather than the recording environment -- affect text-independent
speaker verification.
Participants were native speakers of North American
English who were members of local Toastmasters clubs and had experience in
public speaking. This release includes demographic information for 30 speakers
(15 male, 15 female), including gender, birth year, height, education level,
years in Toastmasters, and a self-evaluation of speaking skills.
SRI-FRTIV is distributed via web download.
SRI-FRTIV is distributed via web download.
2017 Subscription
Members will receive copies of this corpus. 2017 Standard Members may request a
copy as part of their 16 free membership corpora. Non-members may license this
data for a fee.
*
(4) Vehicle City Voices Corpus – Part I was developed at the University
of Michigan-Flint and
is an ongoing oral history project and survey of English language variation in
Flint, Michigan. It contains approximately 16 hours of speech with
corresponding transcripts from interviews of Flint residents conducted between
2012 and 2015. The corpus was designed to provide high-quality recordings for
acoustic analysis and to examine narrative structure and discursive
construction of individual and collective identity in urban spaces.
This release is comprised of 21 interviews by undergraduate and
graduate students for civic engagement projects in linguistics courses and by a
graduate student research assistant. Participants (11 female, 10 male) were
born between 1935 and 1991 and represented a range of ages, genders, and
ethnicities. Of the interviewees, 11 were Black/African American, 8 were
White/Caucasian, and 2 were biracial/mixed ethnic heritage.
Metadata (where provided by participants) includes information on
gender, ethnicity, year of birth, level of education, field of employment,
average income, length of time living in Flint and its surrounding areas, as
well as interviewer age, gender, and ethnicity. In addition, original interview
durations, edited interview durations, interview year, and transcript word
counts are also provided in the metadata file.
Vehicle City Voices Corpus – Part I is available as a web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members may request a
copy as part of their 16 free membership corpora. Non-members may license this
data for a fee.