New Publications:
__________________________________________________________
New publications:
(1) Rhythm
and Pitch contains approximately 27 minutes of spontaneous English
conversations and radio news stories annotated with the Rhythm and Pitch (RaP)
scheme. Speech data for annotation was taken from two corpora released by LDC,
CALLHOME American English Speech (LDC97S42) and Boston
University Radio Speech Corpus (LDC96S36).
The RaP system permits the capture of both intonational and
rhythmic aspects of speech. Four labeling tiers are used for annotating speech
prosody. These tiers carry information about the syllabic organization and
orthography of the speech, its rhythmic structure, tonal patterns, and other
information. More information about the RaP system is available on the RaP homepage.
Speech data are presented as flac compressed 16-bit wav
files. The Boston data are one channel 16kHz files, while the CALLHOME data are
either one or two channel 8kHz files. Annotations are UTF-8 encoded Praat TextGrids.
Rhythm and Pitch is distributed via web download.
2018 Subscription Members will
receive copies of this corpus. 2018 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data for a
fee.
*
(2) GALE
Phase 4 Arabic Broadcast News Speech was developed by LDC and is comprised
of approximately 37 hours of Arabic broadcast news speech collected in 2008 and
2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4
of the DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are released as GALE Phase 4
Arabic Broadcast News Transcripts (LDC2018T14).
The recordings in this release feature news broadcasts
focusing principally on current events from the following sources: Abu Dhabi
TV, a television station based in Abu Dhabi, United Arab Emirates; Al Arabiya,
a news television station based in Dubai; Al Baghdadya, an Iraqi broadcast
programmer; Alhurra, a U.S. government-funded regional broadcaster; Al Iraqiyah,
an Iraqi television station; Aljazeera, a regional broadcaster located in Doha,
Qatar; Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a
national broadcast station based in Kuwait; Radio Sawa, a U.S.
government-funded regional broadcaster; Saudi TV, a national television station
based in Saudi Arabia; Syria TV, the national television station in Syria; and
Yemen TV, a television station based in Yemen.
This release contains 51 audio files presented in
FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel
16-bit PCM. Each file was audited by a native Arabic speaker following Audit
Procedure Specification Version 2.0 which is included in this release.
GALE Phase 4 Arabic Broadcast News Speech is distributed via
web download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.
*
(3) GALE
Phase 4 Arabic Broadcast News Transcripts was developed by LDC and contains
transcriptions of approximately 37 hours of Arabic broadcast news speech
collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global
Autonomous Language Exploitation) program.
Corresponding audio data is released as GALE Phase 4 Arabic
Broadcast News Speech (LDC2018S05).
The transcript files are in plain-text, tab-delimited format
(TDF) with UTF-8 encoding, and the transcribed data totals 204,735 tokens. The
transcripts were created with the LDC tool XTrans, which supports manual
transcription and annotation of audio recordings.
GALE Phase 4 Arabic Broadcast News Transcripts is
distributed via web download.
2018 Subscription Members will receive copies of this
corpus. 2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.