LDC
Membership Discounts for MY 2015 Still Available
New
publications:
If you are considering joining LDC for
Membership Year 2015 (MY2015), there is still time to save on
membership fees. Any organization which joins or renews
membership for 2015 through Monday, March 2, 2015, is entitled to
a 5% discount on membership fees. Organizations which held
membership for MY2014 can receive a 10% discount on fees provided
they renew prior to March 2, 2015. For further information on
planned publications for MY2015, please visit
or contact LDC.
New publications
GALE Phase 2 Arabic Broadcast News Speech Part 2 was developed by LDC and is comprised of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast News Transcripts Part 1 (LDC2015T01).
Broadcast audio for the GALE program was
collected at LDC’s Philadelphia, PA USA facilities and at three
remote collection sites: Hong Kong University of Science and
Technology, Hong King (Chinese), Medianet (Tunis, Tunisia)
(Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local
and outsourced broadcast collection supported GALE at a rate of
approximately 300 hours per week of programming from more than 50
broadcast sources for a total of over 30,000 hours of collected
broadcast audio over the life of the program.
The broadcast recordings in this release
feature news programs focusing principally on current events from
the following sources: Abu Dhabi TV, a television station based in
Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in
Iran; Aljazeera , a regional broadcaster located in Doha, Qatar;
Al Ordiniyah, a national broadcast station in Jordan; Dubai TV,
based in Dubai, United Arab Emirates; Al Iraqiyah, a television
network based in Iraq; Kuwait TV, a national television station
based in Kuwait; Lebanese Broadcasting Corporation, a Lebanese
television station; Nile TV, a broadcast programmer based in
Egypt; Saudi TV, a national television station based in Saudi
Arabia; and Syria TV, the national television station in Syria.
This release contains 204 audio files presented
in FLAC-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit
PCM. Each file was audited by a native Arabic speaker following
Audit Procedure Specification Version 2.0 which is included in
this release. The broadcast auditing process served three
principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or
faulty recordings; as an indicator of broadcast schedule changes
by identifying instances when the incorrect program was recorded;
and as a guide for data selection by retaining information about a
program’s genre, data type and topic.
GALE Phase 2 Arabic Broadcast News Speech Part
2 is distributed on 3 DVD-ROM.
2015 Subscription Members will automatically
receive two copies of this corpus. 2015 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
GALE Phase 2
Arabic Broadcast News Transcripts Part 2 was developed by
LDC and contains transcriptions of approximately 170 hours of
Arabic broadcast news speech collected in 2007 by LDC, MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA
GALE (Global Autonomous Language Exploitation) program.
Corresponding audio data is released as GALE Phase 2 Arabic
Broadcast News Speech Part 2 (LDC2015S01).
The transcript files are in plain-text,
tab-delimited format (TDF) with UTF-8 encoding, and the
transcribed data totals 920,730 tokens. The transcripts were
created with the LDC-developed transcription tool, XTrans,
a multi-platform, multilingual, multi-channel transcription tool
that supports manual transcription and annotation of audio
recordings.
The files in this corpus were transcribed by
LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDC's quick transcription guidelines (QTR)
and quick rich transcription specification (QRTR) both of which
are included in the documentation with this release. QTR
transcription consists of quick (near-)verbatim, time-aligned
transcripts plus speaker identification with minimal additional
mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries
and manual sentence unit annotation to the core components of a
quick transcript. Files with QTR as part of the filename were
developed using QTR transcription. Files with QRTR in the filename
indicate QRTR transcription.
GALE Phase 2 Arabic Broadcast News Transcripts
Part 2 is distributed via web download.
2015 Subscription Members will automatically
receive two copies of this corpus. 2015 Standard Members may
request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
SenSem (Sentence
Semantics) Databank was developed by GRIAL, the Linguistic
Applications Inter-University Research Group that includes the
following Spanish institutions: the Universitat
Autonoma
de Barcelona, the Universitat de
Barcelona, the Universitat
de Lleida and the Universitat
Oberta de Catalunya. It contains syntactic and semantic
annotation for over 35,000 sentences, approximately one million
words of Spanish and approximately 700,000 words of Catalan
translated from the Spanish. GRIAL's work focuses on resources for
applied linguistics, including lexicography, translation and
natural language processing.
Each sentence in SenSem Databank was labeled
according to the verb sense it exemplifies, the type of complement
it takes (arguments or adjuncts) and the syntactic category and
function. Each argument was also labeled with a semantic role.
Further information about the SenSem project can be obtained from
the GRIAL website.
The Spanish source data includes texts from
news journals (30,000 sentences) and novels (5,299 sentences).
Those sentences represent around 1,000 different verb meanings
that correspond to the 250 most frequent Spanish verbs. Verb
frequencies were retrieved from a quantitative analysis of around
13 million words.
The Catalan corpus was developed by translating
the news journal portion of the Spanish data set, resulting in a
resource of over 700,000 sentences from which 391,267 sentences
were annotated. Sentences were automatically translated and
manually post-edited; some were re-annotated for sentence
complements. Semantic information was the same for both languages.
The Catalan sentences represent close to 1,300 different verbs.
SenSem Databank is distributed via web
download.
2015 Subscription Members will automatically
receive two copies of this corpus on disc. 2015 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee. This data is made
available to LDC not-for-profit members and all non-members under
the Creative
Commons Attribution-Noncommercial Share Alike 3.0 license
and to LDC for-profit members under the terms of the For-Profit
Membership Agreement.
Admin i like your post and i need your help can you write a data for me in affordable transcription, if you would do then it would be great help for me.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteI read your article it is great. keep posting like this. Thank you. Godaddy Coupon Codes Apply for best hosting plan
ReplyDelete
ReplyDeleteThanks for sharing Nice Article..keep on update.
Document attestation |List of transcript agencies in pakistan | Transcript agencies in india |Certificate Attestation |Attestation services in dubai