LDC Director
Mark Liberman receives the IEEE James L. Flanagan Speech and Audio Processing
Award
Only two weeks left to enjoy 2017 membership discounts
Spring 2016 LDC Data
Scholarship recipients
New publications:
______________________________________________________________
LDC Director
Mark Liberman receives the IEEE James L. Flanagan Speech and Audio Processing
Award
LDC Director
Mark Liberman is the 2017 recipient of the IEEE James L. Flanagan Speech and Audio Processing
Award. Established in
2002, this annual award recognizes an individual for his or her outstanding
contribution to the advancement of speech and/or audio processing. Liberman’s pioneering
contributions and continued leadership in robust, replicable, and data-driven
speech and language science and engineering have fueled the development and advancement of human language
technologies including speech and speaker recognition, machine translation, and
semantic analysis. As LDC’s founder, Mark has shepherded the Consortium from a
small organization to the largest developer of shared language resources,
distributing more than 120,000 copies of over 2,000 databases covering 91
different languages to more than 3,600 organizations in over 70 countries.
Liberman will receive the award at ICASSP
2017 in New Orleans (March 5-9). LDC will be an exhibitor at
Booth 43. Please stop by and say hello. We hope to see you
there.
Only two weeks left to enjoy 2017 membership discounts
There
is still time to save on 2017 membership fees. Through March 1, all
organizations receive a discount on the 2017 membership fee (up to 10%) when
they choose to join or renew.
For more information on membership benefits, visit Join LDC.
For more information on membership benefits, visit Join LDC.
Spring 2016 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Spring 2017 data scholarship:
Congratulations to the recipients of LDC's Spring 2017 data scholarship:
For information about the program, visit the Data
Scholarship page.
New publications:
(1) First-Year Law Students' Court Memoranda consists
of 197 English law student writing samples of legal briefs annotated for
certain characteristics along with accompanying survey responses by student
writers.
The samples were imported into the General Architecture for Text Engineering (GATE) and annotated by two human coders who identified large text segments specific to the legal genre in which the students wrote, such as text headings, citations, block quotes and footnotes.
Writing samples are presented as MS Word documents and annotations and survey responses are presented in XML format. The data has been anonymized to remove names and other identifying information about the student participants.
First-Year Law Students' Court
Memoranda is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(2) IARPA
Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 203
hours of Haitian Creole conversational and scripted telephone speech collected
in 2012 and 2013 along with corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Haitian Creole speech in this release represents that spoken in the
Northern, Western and Southern dialect regions in Haiti. The gender
distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years.
Calls were made using different telephones (e.g., mobile, landline) from a
variety of environments including the street, a home or office, a public place,
and inside a vehicle.
Transcripts are encoded in UTF-8.
IARPA Babel Haitian Creole Language Pack
IARPA-babel201b-v0.2b is distributed via web download.
2017 Subscription Members will receive copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2017 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
*
(3) GALE Phase 3 Arabic Broadcast
News Speech Part 2 was developed by LDC
and is comprised of approximately 128 hours of Arabic broadcast conversation
speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language
Exploitation) program.
Corresponding
transcripts are released as GALE Phase 3 Arabic Broadcast News Transcripts Part
2 (LDC2017T04).
The
recordings in this corpus feature news broadcasts focusing principally on
current events from various broadcast programmers including Abu Dhabi TV, Al
Alam News Channel, Al Arabiya, Al Iraqiyah, Aljazeera, Al Ordiniyah, Dubai TV,
Kuwait TV, Lebanese Broadcast Corporation, Nile TV, Saudi TV and Syria TV.
This
release contains 175 audio files presented in FLAC-compressed Waveform Audio
File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited
by a native Arabic speaker.
GALE
Phase 3 Arabic Broadcast News Speech Part 2 is distributed via web
download.
2017
Subscription Members will automatically receive copies of this corpus. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.
*
(4)
GALE Phase 3
Arabic Broadcast News Transcripts Part 2 was
developed by LDC and contains transcriptions of approximately 128 hours of
Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia
and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) program.
Corresponding audio data is released as GALE Phase 3 Arabic
Broadcast News Speech Part 2 (LDC2017S02).
The
transcript files are in plain-text, tab-delimited format (TDF) with UTF-8
encoding, and the transcribed data totals 721,846 tokens. The transcripts were
created with the LDC tool, XTrans, which supports manual transcription and annotation of
audio recordings.
The
files in this corpus were transcribed by LDC staff and/or by transcription
vendors under contract to LDC. Transcribers followed LDC's quick transcription
guidelines (QTR) and quick rich transcription specification (QRTR) both of
which are included in the documentation with this release.
GALE
Phase 3 Arabic Broadcast News Transcripts Part 2 is distributed via web
download.
2017
Subscription Members will automatically receive copies of this corpus. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.