LDC at Interspeech 2015
2013 Data Pack deadline is September 15
LDC co-organizes LSA2016 Pre-conference Workshop
New publications:
Fall 2015 LDC Data
Scholarship program - September 15 deadline approaching
Student applications for the Fall 2015 LDC Data
Scholarship program are being accepted now through Tuesday,
September 15, 2015, 11:59PM EST. The LDC Data Scholarship program
provides university students with access to LDC data at no cost.
This program is open to students pursuing both undergraduate and
graduate studies in an accredited college or university. LDC Data
Scholarships are not restricted to any particular field of study;
however, students must demonstrate a well-developed research
agenda and a bona fide inability to pay.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Applicants can email their materials to the LDC Data Scholarship program.
LDC at Interspeech 2015
LDC will once again be exhibiting at Interspeech, held this year September 7-10 in Dresden, Germany. Stop by booth 20 to learn more about recent developments at the Consortium and new publications.
Also, be on the lookout for the following presentations featuring LDC work:
Investigating Consonant Reduction in Mandarin Chinese with Improved Forced Alignment: Jiahong Yuan and Mark Liberman (both LDC)
Wednesday 9 September, Oral Session 36-5, 17:50-18:10
The Effect of Spectral Slope on Pitch Perception: Jianjing Kuang (UPenn) and Mark Liberman (LDC)
One month remains for not-for-profit and
government organizations to create a custom data collection of
eight corpora from among LDC’s 2013 releases. Selection options
include: 1993-2007 United Nations Parallel Text, Chinese Treebank
8.0, CSC Deceptive Speech, GALE Arabic and Chinese speech and text
releases, Greybeard, MADCAT training data, NIST 2012 Open Machine
Translation (OpenMT) evaluation and progress sets, and more. The 2013 Data Pack is
available for a flat rate of $3500 through September 15, 2015.
University of Arizona’s Malcah Yeager-Dror and
LDC’s Chris Cieri are organizing the upcoming LSA
2016
workshop “Preparing your Corpus for Archival Storage”. The
session is sponsored by the National Science Foundation (BCS
#1549994) and will be held on Thursday, January 7, 2016 in
Washington, DC before the start of the 90th Annual Meeting of the
Linguistic Society of America (LSA
2016).
There will be no additional registration fees
to attend the session for those already taking part in the annual
meeting. Students who are about to carry out their own fieldwork,
or who have begun doing so, are eligible to apply
for
funding by November 2, 2015 to help defray the extra costs
for attending the workshop. For more information about the
speakers and topics, visit LDC’s
workshop
page.
New publications
(1) Arabic Learner Corpus was developed at the University of Leeds and
consists of written essays and spoken recordings by Arabic
learners collected in Saudi Arabia in 2012 and 2013. The corpus
includes 282,732 words in 1,585 materials, produced by 942
students from 67 nationalities studying at pre-university and
university levels. The average length of an essay is 178 words.
Two tasks were used to collect the written
data, and participants had the choice to do one or both of them.
In each of those tasks, learners were asked to write a narrative
about a vacation trip and a discussion about the participant's
study interest. Those choosing the first task generated a 40
minute timed essay without the use of any language reference
materials. In the second task, participants completed the writing
as a take-home assignment over two days and were permitted to use
language reference materials.
The audio recordings were developed by allowing
students a limited amount of time to talk about the topics above
without using language reference materials.
The original handwritten essays were
transcribed into an electronic text format. The corpus data
consists of three types: (1) handwritten sheets scanned in PDF
format; (2) audio recordings in MP3 format; and (3) textual
unicode data in plain text and XML formats (including the
transcribed audio and transcripts of the handwritten essays). The
audio files are either 44100Hz 2-channel or 16000Hz 1-channel mp3
files.
Arabic Learner Corpus is distributed via web
download.
2015 Subscription Members will automatically
receive two copies of this corpus provided that they have
completed the license
agreement. 2015 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this
data for a fee.
*
(2) GALE Phase 3
Arabic Broadcast Conversation Speech Part 1 was developed by
LDC and is comprised of approximately 123 hours of Arabic
broadcast conversation speech collected in 2007 by LDC, MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA
GALE (Global Autonomous Language Exploitation) program.
Corresponding transcripts are released as GALE
Phase 3 Arabic Broadcast Conversation Transcripts Part 1 (LDC2015T16).
Broadcast audio for the GALE program was
collected at LDC’s Philadelphia, PA USA facilities and at three
remote collection sites. The combined local and outsourced
broadcast collection supported GALE at a rate of approximately 300
hours per week of programming from more than 50 broadcast sources
for a total of over 30,000 hours of collected broadcast audio over
the life of the program.
The broadcast conversation recordings in this
release feature interviews, call-in programs and roundtable
discussions focusing principally on current events from the
following sources: Abu Dhabi TV, Al Alam News Channel, Al
Arabiya, Aljazeera, Al
Ordiniyah, Dubai TV, Lebanese Broadcasting Corporation, Oman
TV, Saudi TV, and Syria TV.
This release contains 149 audio files presented
in FLAC-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit
PCM.
GALE Phase 3 Arabic Broadcast Conversation
Speech Part 1 is distributed via web download.
2015 Subscription Members will automatically
receive two copies of this corpus. 2015 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(3) GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 was
developed by LDC and contains transcriptions of approximately 123
hours of Arabic broadcast conversation speech collected in 2007 by
LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase
3 of the DARPA GALE (Global Autonomous Language Exploitation)
program.
Corresponding audio data is released as GALE
Phase 3 Arabic Broadcast Conversation Speech Part 1 (LDC2015S11).
The transcript files are in plain-text,
tab-delimited format (TDF) with UTF-8 encoding, and the
transcribed data totals 733,233 tokens. The transcripts were
created with the LDC-developed transcription tool, XTrans,
a multi-platform, multilingual, multi-channel transcription tool
that supports manual transcription and annotation of audio
recordings.
The files in this corpus were transcribed by
LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDC's quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) both of which
are included in the documentation with this release.
GALE Phase 3 Arabic Broadcast Conversation
Transcripts Part 1 is distributed via web download.
2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.