Linguistic Data Consortium: February 2017

LDC Director Mark Liberman receives the IEEE James L. Flanagan Speech and Audio Processing Award

Only two weeks left to enjoy 2017 membership discounts

Spring 2016 LDC Data Scholarship recipients

New publications:

First-Year Law Students' Court Memoranda

IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b

GALE Phase 3 Arabic Broadcast News Speech Part 2

GALE Phase 3 Arabic Broadcast News Transcripts Part 2

______________________________________________________________

LDC Director Mark Liberman receives the IEEE James L. Flanagan Speech and Audio Processing Award

LDC Director Mark Liberman is the 2017 recipient of the IEEE James L. Flanagan Speech and Audio Processing Award. Established in 2002, this annual award recognizes an individual for his or her outstanding contribution to the advancement of speech and/or audio processing. Liberman’s pioneering contributions and continued leadership in robust, replicable, and data-driven speech and language science and engineering have fueled the development and advancement of human language technologies including speech and speaker recognition, machine translation, and semantic analysis. As LDC’s founder, Mark has shepherded the Consortium from a small organization to the largest developer of shared language resources, distributing more than 120,000 copies of over 2,000 databases covering 91 different languages to more than 3,600 organizations in over 70 countries.

Liberman will receive the award at ICASSP 2017 in New Orleans (March 5-9). LDC will be an exhibitor at Booth 43. Please stop by and say hello. We hope to see you there.

Only two weeks left to enjoy 2017 membership discounts

There is still time to save on 2017 membership fees. Through March 1, all organizations receive a discount on the 2017 membership fee (up to 10%) when they choose to join or renew.

For more information on membership benefits, visit Join LDC.

Spring 2016 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2017 data scholarship:

Umad Ul Hassan and Muhammad Awais Zulfiqar: National University of Sciences and Technology (Pakistan); BS Computer Science. Hassan and Zulfiqar are awarded copies of CSLU: Kids’ Speech Version 1.1 and The CMU Kids Corpus for their research in speech recognition for children with learning difficulties.

For information about the program, visit the Data Scholarship page.

New publications:

(1) First-Year Law Students' Court Memoranda consists of 197 English law student writing samples of legal briefs annotated for certain characteristics along with accompanying survey responses by student writers.

The briefs were created in a law school writing class at two law schools in the US Midwest during the 2011-12 academic year. Students who agreed to participate in this study uploaded their briefs to an online survey instrument and answered questions regarding their age, gender, level of education, most recent writing course and method of learning English. The study's purpose was to apply natural language processing approaches to determine any differences in the briefs' language attributable to the students' self-reported genders.

The samples were imported into the General Architecture for Text Engineering (GATE) and annotated by two human coders who identified large text segments specific to the legal genre in which the students wrote, such as text headings, citations, block quotes and footnotes.

Writing samples are presented as MS Word documents and annotations and survey responses are presented in XML format. The data has been anonymized to remove names and other identifying information about the student participants.

First-Year Law Students' Court Memoranda is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Haitian Creole conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Haitian Creole speech in this release represents that spoken in the Northern, Western and Southern dialect regions in Haiti. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b is distributed via web download.

(3) GALE Phase 3 Arabic Broadcast News Speech Part 2 was developed by LDC and is comprised of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News Transcripts Part 2 (LDC2017T04).

The recordings in this corpus feature news broadcasts focusing principally on current events from various broadcast programmers including Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Iraqiyah, Aljazeera, Al Ordiniyah, Dubai TV, Kuwait TV, Lebanese Broadcast Corporation, Nile TV, Saudi TV and Syria TV.

This release contains 175 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker.

GALE Phase 3 Arabic Broadcast News Speech Part 2 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) GALE Phase 3 Arabic Broadcast News Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 128 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News Speech Part 2 (LDC2017S02).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 721,846 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Arabic Broadcast News Transcripts Part 2 is distributed via web download.

Linguistic Data Consortium

Wednesday, February 15, 2017

LDC February 2017 Newsletter