New publications:
High School students use LDC data
A team of students at
Thomas Jefferson High School for Science and Technology in
Alexandria, VA, USA, have used an LDC database for the development
of a device to help autistic children recognize emotions. This team was funded by a
grant from the Lemelson-MIT
InvenTeam
Initiative Program. InvenTeams
are
groups of high school students, teachers, and mentors that receive
grants up to US$10,000 each to invent
technological solutions to real-world problems.
The team set out to invent an emotive aid in
the form of a bracelet that uses a computational algorithm to
extract emotional signatures from speech and display expressed
emotions in real-time during a conversation. Potential
beneficiaries include children with autism, Asperger’s syndrome,
or similar diseases that impair the ability to detect emotion. The algorithm employed
machine learning and neural network-based techniques to improve
accuracy and efficiency relative to current methods.
The students used speech samples from the LDC database, Emotional Prosody Speech and Transcripts (LDC2002S28) as well the Berlin Database of Emotional Speech for training and testing their algorithm. Although the samples proved to be too small to produce an algorithm with a high degree of accuracy, the team's algorithm did demonstrate some degree of success. The students will present their results at Eurekafest at MIT in June.
The students used speech samples from the LDC database, Emotional Prosody Speech and Transcripts (LDC2002S28) as well the Berlin Database of Emotional Speech for training and testing their algorithm. Although the samples proved to be too small to produce an algorithm with a high degree of accuracy, the team's algorithm did demonstrate some degree of success. The students will present their results at Eurekafest at MIT in June.
LDC thanks the InvenTeam’s teacher, Mark
Hannum, and group leader, Suhas Gondi, for contributing to this
article.
New publications
(1) GALE
Phase
2 Chinese Broadcast Conversation Parallel Text Part 1 was
developed by LDC. Along with other corpora, the parallel text in
this release comprised training data for Phase 2 of the DARPA GALE
(Global Autonomous Language Exploitation) Program. This corpus
contains Chinese source text and corresponding English
translations selected from broadcast conversation (BC) data
collected by LDC in 2006 and 2007 and transcribed by LDC or under
its direction.
This release includes 21 source-translation
document pairs, comprising 146,082 characters of Chinese source
text and its English translation. Data is drawn from seven
distinct Chinese programs broadcast in 2006 and 2007 from the
following sources -- China Central TV, a national and
international broadcaster in Mainland China and Phoenix TV, a Hong
Kong-based satellite television station. Broadcast conversation
programming is generally more interactive than traditional news
broadcasts and includes talk shows, interviews, call-in programs
and roundtable discussions. The programs in this release focus on
current events topics.
The data was transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with
Quick Rich Transcription guidelines developed by LDC. Transcribers
indicated sentence boundaries in addition to transcribing the
text. Data was manually selected for translation according to
several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format and
assigned to translation vendors. Translators followed LDCs Chinese
to English translation guidelines. Bilingual LDC staff performed
quality control procedures on the completed translations.
GALE Phase 2 Chinese Broadcast Conversation
Parallel Text Part 1 is distributed via web download. 2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
(2) Greybeard
was developed by LDC and is comprised of approximately 590 hours
of English telephone conversation speech collected in October and
November 2008 by LDC. The goal was to record new telephone
conversations among subjects who had participated in one or more
previous LDC telephone collections, from Switchboard-1 (1991)
through the Mixer studies (2006).
A total of 172 subjects were enrolled in the
Greybeard collection, all of whom had participated in one of the
following:
- Switchboard-1 (LDC97S62) 1991-1992: 2 subjects
- Switchboard-2 (LDC98S75, LDC99S79, LDC2002S06) 1996-1997: 16 subjects
- Mixer 1 and 2 2003-2005: 103 subjects
- Mixer 3 2006: 51 subjects
Most Greybeard participants completed 12 calls.
Some subjects completed up to 24 calls. Calls were made or
received via an automatic operator system at LDC which connected
two participants and announced a topic for discussion.
This release consists of 4680 calls -- the
complete set of calls recorded during the Greybeard collection
(1098 calls) as well as all calls from the legacy collections that
involved the Greybeard speakers.
The audio from each call was captured digitally
by the operator system and stored in a separate file as raw mu-law
sample data. As the recordings were uploaded daily from the robot
operator to network disk storage, automated processes reformatted
the audio into a 2-channel SPHERE-format file for each
conversation and queued the recordings for manual audit to verify
speaker identification and to check other aspects of the
recording.
Auditors provided impressionistic judgments on overall
audio quality, presence of background noise and cross-channel echo
and any other technical difficulty with the call, in addition to
confirming the speaker-ID on each channel.
Greybeard is distributed on five DVDs. 2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
(3) Manually
Annotated
Sub-Corpus Third Release (MASC) was developed as part of The American National Corpus
project and consists of approximately 500,000 words of
contemporary American English written and spoken data annotated
for a wide variety of linguistic phenomena.
The MASC project was established to address, to
the extent possible, many of the obstacles to the creation of
large-scale, robust, multiply-annotated corpora of English
covering a wide range of genres of written and spoken language
data. The project provides appropriate data and annotations to
serve as the base for a community-wide annotation effort, together
with an infrastructure that enables the incorporation of
contributed annotations into a single, usable format that can then
be analyzed as it is or transduced to any of a variety of other
formats. Further information about the project is available at the
MASC website.
The source texts were drawn from the open
portion of the American
National
Corpus Second Release, and from the Language
Understanding
Annotation Corpus. MASC
Third
Release includes the contents of MASC First Release (LDC2010T22)
(82,000 words) which is also available from LDC. There is no
second release.
All data in this release was annotated for
logical structure (paragraph, headings, etc.), token and sentence
boundaries, part of speech and lemma, shallow parse (noun and verb
chunks) and named entities (person, organization, location and
date). Portions of the corpus were also annotated for FrameNet
frames (40k full text), Penn Treebank syntax (82k) and opinion
(50k).
Manually Annotated Sub-Corpus Third Release is
distributed via web download.
2013 Subscription Members will automatically
receive two copies of this data on disc. 2013 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members may request this data by submitting a signed copy of LDC
User
Agreement for Non-members. This
data
is available at no-cost.