2013 Data Pack
available through September 15
LDC supports NSF data
management plans
New publications:
________________________________________________________________________
2013 Data Pack
available through September 15
Not-for-profit and government organizations can now create a custom data
collection from among LDC’s 2013 releases. The 2013 Data Pack allows users
to license eight corpora published in 2013 for a flat rate of US$3500.
Selection options include Greybeard, NIST 2012 Open Machine Translation
(OpenMT) evaluation and progress sets, Chinese Treebank 8.0, GALE Arabic and
Chinese speech and text releases, 1993-2007 United Nations Parallel Text,
MADCAT training data, CSC Deceptive Speech and more. Organizations acquire perpetual rights to the
corpora licensed through the pack. The Data Pack is not a membership, and organizations
must request all eight data sets at the time of purchase. The 2013 Data Pack is
available to not-for-profit and government organizations for a limited time only, through
September 15.
To license the Data Pack and select eight corpora, login or
register for an LDC user account
and add the 2013 Data Pack
and each of the eight data sets to your bin. Follow the check-out procedure,
sign all applicable user agreements and select payment via wire transfer,
purchase order or check. LDC will adjust the invoice total to reflect the data
pack fee.
To pay via credit card, add the 2013 Data Pack to your bin
and check out using the system prompts. At the completion of the transaction,
send an email to ldc@ldc.upenn.edu
indicating the eight data sets to include in your order.
As always, users can contact ldc@ldc.upenn.edu to facilitate the
transaction.
LDC supports NSF data
management plans
This month’s publication of The Subglottal Resonances
Database is the latest in a series of releases of data developed with National
Science Foundation (NSF) funding. Long before researchers were required to
develop data management plans, they deposited their research data at LDC in
accordance with NSF’s longstanding desire that data generated with program
funds should be readily accessible at a reasonable cost. Well known data sets in the series include
The Santa Barbara Corpus of Spoken American English (multiple parts), Propbank
and Grassfields Bantu Fieldwork.
NSF now requires researchers to deposit funded data in an
accessible, trustworthy archive. LDC’s expertise in data curation, distribution
and management and its commitment to the broad accessibility of linguistic data
make it the repository of choice for NSF-funded data. Learn more about how LDC
can assist in developing and implementing data management plans from the Data
Management Plans section on our website or contact LDC Data Management Plans.
The Subglottal Resonances Database was developed with the
support of NSF Grant No. 0905250. It is available to LDC members at no cost;
non-members may license the data set for a fee of $30 plus shipping.
New publications
(1) GALE
Phase 3 and 4 Arabic Broadcast News Parallel Text includes 86
source-translation document pairs, comprising 325,538 words of Arabic source
text and its English translation. Data is drawn from 28 distinct Arabic
programs broadcast between 2007 and 2008 from Abu Dhabi TV, Al Alam News
Channel, Al Arabiya, Al Baghdadya, Alhurra, Al Iraqiyah, Aljazeera, Al
Ordiniyah, Al Sharqiya, Dubai TV, Kuwait TV, Lebanese Broadcasting Corporation,
Oman TV, Radio Sawa, Saudi TV, and Syria TV. Broadcast news programming
consists of news programs focusing principally on current events.
The files in this release were transcribed by LDC staff
and/or transcription vendors under contract to LDC in accordance with the Quick
Rich Transcription guidelines developed by LDC. Transcribers indicated sentence
boundaries in addition to transcribing the text. Data was manually selected for
translation according to several criteria, including linguistic features,
transcription features and topic features. The transcribed and segmented files
were then reformatted into a human-readable translation format and assigned to
translation vendors. Translators followed LDC's Arabic to English translation
guidelines. Bilingual LDC staff performed quality control procedures on the completed
translations.
Source data and translations are distributed in TDF format.
TDF files are tab-delimited files containing one segment of text along with
meta information about that segment. Each field in the TDF file is described in
TDF_format.txt. All data are encoded in UTF-8.
GALE Phase 3 and 4 Arabic Broadcast News Parallel Text is
distributed via web download. 2015 Subscription Members will
automatically receive two copies of this corpus on disc. 2015 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(2) Mandarin
Chinese Phonetic Segmentation and Tone was developed by LDC and contains
7,849 Mandarin Chinese "utterances" and their phonetic segmentation
and tone labels separated into training and test sets. The utterances were
derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That
collection consists of approximately 30 hours of Chinese broadcast news
recordings from Voice of America, China Central TV and KAZN-AM, a commercial
radio station based in Los Angeles, CA. The ability to use large speech corpora
for research in phonetics, sociolinguistics and psychology, among other fields,
depends on the availability of phonetic segmentation and transcriptions. This
corpus was developed to investigate the use of phone boundary models on forced
alignment in Mandarin Chinese. Using the approach of embedded tone modeling
(also used for incorporating tones for automatic speech recognition), the performance
on forced alignment between tone-dependent and tone-independent models was
compared.
Utterances were considered as the time-stamped between-pause
units in the transcribed news recordings. Those with background noise, music,
unidentified speakers and accented speakers were excluded. A test set was
developed with 300 utterances randomly selected from six speakers (50
utterances for each speaker). The remaining 7,549 utterances formed a training
set.
The utterances in the test set were manually labeled and
segmented into initials and finals in Pinyin, a Roman alphabet system for
transcribing Chinese characters. Tones were marked on the finals, including
Tone1 through Tone4, and Tone0 for the neutral tone. The Sandhi Tone3 was
labeled as Tone2. The training set was automatically segmented and transcribed
using the LDC forced aligner, which is a Hidden Markov Model (HMM) aligner
trained on the same utterances (Yuan et al. 2014). The aligner achieved 93.1%
agreement (of phone boundaries) within 20 ms on the test set compared to manual
segmentation. The quality of the phonetic transcription and tone labels of the
training set was evaluated by checking 100 utterances randomly selected from
it. The 100 utterances contained 1,252 syllables: 15 syllables had mistaken
tone transcriptions; two syllables showed mistaken transcriptions of the final,
and there were no syllables with transcription errors on the initial.
Each utterance has three associated files: a flac compressed
wav file, a word transcript file, and a phonetic boundaries and label file.
Mandarin Chinese Phonetic Segmentation and Tone is
distributed via web download.
2015 Subscription Members will automatically receive two
copies of this corpus on disc, provided that they have submitted a completed copy
of the user
license agreement. 2015 Standard Members may request a copy as part
of their 16 free membership corpora. As a members only release, Mandarin
Chinese Phonetic Segmentation and Tone is not available for non-member
licensing.
*
(3) The
Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and
consists of 45 hours of simultaneous microphone and subglottal accelerometer
recordings of 25 adult male and 25 adult female speakers of American English
between 22 and 25 years of age.
The subglottal system is composed of the airways of the
tracheobronchial tree and the surrounding tissues. It powers airflow through
the larynx and vocal tract, allowing for the generation of most of the sound
sources used in languages around the world. The subglottal resonances (SGRs)
are the natural frequencies of the subglottal system. During speech, the
subglottal system is acoustically coupled to the vocal tract via the larynx.
SGRs can be measured from recordings of the vibration of the skin of the neck
during phonation by an accelerometer, much like speech formants are measured
through microphone recordings. SGRs have received attention in studies of
speech production, perception and technology. They affect voice production,
divide vowels and consonants into discrete categories, affect vowel perception
and can be useful in automatic speech recognition.
Speakers were recruited by Washington University's
Psychology Department. The majority of the participants were Washington
University students who represented a wide range of American English dialects,
although most were speakers of the mid-American English dialect. The corpus
consists of 35 monosyllables in a phonetically neutral carrier phrase (“I said
a ____ again”), with 10 repetitions of each word by each speaker, resulting in
17,500 individual microphone (and accelerometer) waveforms. The monosyllables
were comprised of 14 hVd words and 21 CVb words where C was b,d, g and V
included all AE monophthongs and diphthongs. The target vowel in each utterance
was hand-labeled to indicate the start, stop, and steady-state parts of the
vowel. For diphthongs, the steady-state refers to the diphthong nucleus which
occurs early in the vowel.
Audio files are presented as single channel 16-bit flac compressed
wav files with sample rates of 48kHz or 16kHz. Image files are bitmap image
files and plain text is UTF-8.
The Subglottal Resonances Database is distributed on one USB
drive.
2015 Subscription Members will automatically receive a copy
of this corpus. 2015 Standard Members may request a copy as part of their 16
free membership corpora. Non-members may license this data for a fee.