Only
two weeks left to enjoy 2016 membership savings
Spring
2016 LDC Data Scholarship recipients
How to Share Data
through LDC webinar on YouTube
New publications:
_______________________________________________________________________
Only two weeks left to enjoy 2016
membership savings
There’s still time to save on 2016 membership fees. Now
through March 1, all organizations receive a 5% discount when they join for
MY2016. MY2015 members are eligible for an additional 5% off the fee (10% total
savings) when they renew before March 1.
To join, create or sign
into your LDC user account, select your preferred
membership type from the Catalog,
add the item to your bin and follow the check-out process. The Membership
Office will apply any discounts. Alternatively, if you have already received a
renewal invoice from LDC, you can simply pay against that.
Spring
2016 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Spring 2016 data scholarships:
Shefali Waldekar: Indian Institute of Technology Kharagpur (India), PhD Candidate, Electronics and Electrical Communications Engineering. Shefali is awarded copies of 2002 Rich Transcription Broadcast News and Conversational Telephone Speech and 2005 Spring NIST Rich Transcription (RT05-S) Evaluation Set for her research in audio diarization.
Congratulations to the recipients of LDC's Spring 2016 data scholarships:
Shefali Waldekar: Indian Institute of Technology Kharagpur (India), PhD Candidate, Electronics and Electrical Communications Engineering. Shefali is awarded copies of 2002 Rich Transcription Broadcast News and Conversational Telephone Speech and 2005 Spring NIST Rich Transcription (RT05-S) Evaluation Set for her research in audio diarization.
Nikola Invanov Nikolov: University of Zurich and ETH
Zurich (Switzerland), MSc candidate in Informatics. Nikola is awarded a copy of
Annotated English Gigaword for his research in text summarization.
Om Prakash Singh: Indian Institute of Technology, Guwahati
(India), Research scholar in spoken language identification. Om is awarded a
copy of NIST Language Recognition Evaluation Test Set for his work in language
identification.
Moshen Mohammadi: Iranian Research Institute for
Electrical Engineering (Iran), PhD Candidate in Communications. Moshen is
awarded copies of the 2008 NIST Speaker Recognition Evaluation Training Sets 1
and 2, the Evaluation Test Set and the Supplemental Set for his work in speaker
recognition in noisy environments.
For program information visit the Data
Scholarship page.
How to Share Data
through LDC webinar on YouTube
LDC’s first webinar, How to Share Data through LDC, is now
available for viewing on our YouTube
page. Presented live on January 22, 2016, the webinar outlined in easy steps
the process for submitting language resources to LDC for publication in the
Catalog. In addition, discussion topics included the benefits of sharing data
through LDC, the corpus life cycle, data delivery, quality control and more.
New Corpora
(1) BOLT Chinese
Discussion Forums was developed by LDC and
consists of 1,597,500 discussion forum threads in Chinese harvested from the
Internet using a combination of manual and automatic processes.
The DARPA BOLT (Broad Operational Language Translation) program developed
machine translation and information retrieval for less formal genres, focusing
particularly on user-generated content. LDC supported the BOLT program by
collecting informal data sources -- discussion forums, text messaging and chat
-- in Chinese, Egyptian Arabic and English. The collected data was translated
and annotated for various tasks including word alignment, treebanking,
propbanking and co-reference. The material in this release represents the
Chinese source data in the discussion forum genre.
Collection was seeded based
on the results of manual data scouting by native speaker annotators. When
multiple threads from a forum were submitted, the entire forum was
automatically harvested and added to the collection. The scale of the
collection precluded manual review of all data. Only a small portion of the
threads included in this release were manually reviewed, and it is expected
that there may be some offensive or otherwise undesired content as well as some
threads that contain a large amount of non-Chinese content. Language
identification was performed on all threads in this corpus (using CLD2), and threads for which the results indicated a high probability
of largely non-Chinese content are identified in this release.
BOLT Chinese Discussion
Forums is distributed via web download as a multi-part zip file. Consult
the Using LDC Data page (https://www.ldc.upenn.edu/data-management/using)
for more information about this format.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.
*
(2) GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 was developed by LDC
and is comprised of approximately 129 hours of Arabic broadcast conversation
speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and MTC,
Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language
Exploitation) program.Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 (LDC2016T06).
These broadcast conversation recordings feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 142 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.
GALE Phase 3 Arabic Broadcast
Conversation Speech Part 2 is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.
*
(3) GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 was developed by LDC
and contains transcriptions of approximately 129 hours of Arabic broadcast
conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia
and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous
Language Exploitation) program.Corresponding audio data is released as GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 (LDC2016S01).
The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 845,791 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.
The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.
GALE Phase 3 Arabic Broadcast
Conversation Transcripts Part 2 is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.