Linguistic Data Consortium: September 2017

New Publications:

________________________________________________________________________

New publications:

(1) 2015-2016 CoNLL Shared Task contains the Chinese and English training, development and test data for the 2015 and 2016 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation which focused on shallow discourse parsing. This release consists of the tokenized, tagged, and parsed tags in English and Chinese. The English train, dev and test data are from Wall Street Journal material in Penn Discourse Treebank Version 2.0 (LDC2008T05); English blind test data are from wikinews. Chinese train, dev and test data are news material from Chinese Discourse Treebank 0.5 (LDC2014T21); Chinese blind test data are from wikinews.

LDC has also released the following CoNLL Shared Task data sets:

· 2006 CoNLL Shared Task - Ten Languages (LDC2015T11)

· 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12)

· 2008 CoNLL Shared Task Data (LDC2009T12)

· 2009 CoNLL Shared Task Part 1 (LDC2012T03)

· 2009 CoNLL Shared Task Part 2 (LDC2012T04)

2015-2016 CoNLL Shared Task is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 211 hours of Zulu conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Zulu speech in this release represents that spoken in the KZN (KwaZulu-Natal)-urban dialect region of South Africa. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) SRI-FRTIV (Five-way Recorded Toastmaster Intrinsic Variation) was developed by SRI International in 2007-2008 and is comprised of approximately 232 hours of English speech from thirty-four speakers who were members of Toastmaster clubs. Participants were asked to speak at three different levels of effort (low, normal and high) in four different styles (interview, conversation, reading and oration) to study the question of how intrinsic variations -- associated with the speaker rather than the recording environment -- affect text-independent speaker verification.

Participants were native speakers of North American English who were members of local Toastmasters clubs and had experience in public speaking. This release includes demographic information for 30 speakers (15 male, 15 female), including gender, birth year, height, education level, years in Toastmasters, and a self-evaluation of speaking skills.

SRI-FRTIV is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) Vehicle City Voices Corpus – Part I was developed at the University of Michigan-Flint and is an ongoing oral history project and survey of English language variation in Flint, Michigan. It contains approximately 16 hours of speech with corresponding transcripts from interviews of Flint residents conducted between 2012 and 2015. The corpus was designed to provide high-quality recordings for acoustic analysis and to examine narrative structure and discursive construction of individual and collective identity in urban spaces.

This release is comprised of 21 interviews by undergraduate and graduate students for civic engagement projects in linguistics courses and by a graduate student research assistant. Participants (11 female, 10 male) were born between 1935 and 1991 and represented a range of ages, genders, and ethnicities. Of the interviewees, 11 were Black/African American, 8 were White/Caucasian, and 2 were biracial/mixed ethnic heritage.

Metadata (where provided by participants) includes information on gender, ethnicity, year of birth, level of education, field of employment, average income, length of time living in Flint and its surrounding areas, as well as interviewer age, gender, and ethnicity. In addition, original interview durations, edited interview durations, interview year, and transcript word counts are also provided in the metadata file.

Vehicle City Voices Corpus – Part I is available as a web download.

Linguistic Data Consortium

Thursday, September 14, 2017

LDC September 2017 Newsletter