Tuesday, November 15, 2016

LDC November 2016 Newsletter



In this newsletter:

Join LDC for Membership Year 2017
Commercial use and LDC data
Spring 2017 Data Scholarship Program
LDC closed November 24-25 for US Thanksgiving Holiday

New publications:






Join LDC for Membership Year 2017

Organizations engaged in language-related research, education and technology development are invited to join LDC for Membership Year (MY) 2017. Consortium members enjoy unparalleled access and continuing rights to new data releases and to an archive of close to 700 holdings.

Membership fees have not increased for 2017. In addition, discounts are available for organizations who keep their membership current and for those who join before March 1, 2017.

           • MY 2016 members receive a 10% discount if they renew their membership before March 1, 2017. After March 1, MY2016 members receive a 5% discount if they renew their membership any time in 2017.
           • New members and returning former members receive a 5% discount off the membership fee if they join/renew before March 1, 2017.

Plans for MY2017 publications are in progress. Among the expected releases are:

2010 NIST Speaker Recognition Evaluation data set
Multilanguage conversational telephone speech: developed to support language identification research in related languages
UCLA High Speed Laryngeal Database: audio recordings and high-speed videoendoscopic images of the vocal folds while sustaining vowels
Noisy TIMIT: TIMIT with added artificial noise
CHiME shared task data: noisy read WSJ speech
First Year Law Students’ Memoranda: memos to a hypothetical court with annotations
IARPA Babel Language Packs: languages include Vietnamese, Haitian Creole, Zulu, Kazakh and Lithuanian
BOLT: source, parallel and word-aligned data in all languages
RATS Keyword Spotting data set
GALE Phases 3 and 4: all tasks and languages   

Visit Join LDC for details on membership, user accounts and payment.

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information. 

Spring 2017 Data Scholarship Program

Applications are now being accepted through January 15, 2017 for the Spring 2017 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for further information about program rules and submission requirements.

LDC closed November 24-25 for US Thanksgiving Holiday

LDC will be closed on Thursday, November 24, 2016 and Friday, November 25, 2016 in observance of the US Thanksgiving Holiday. The office will reopen on Monday, November 28, 2016.

New Corpora

(1) JANA: A Human-Human Dialogues Corpus for Egyptian Dialect was developed by researchers at Cairo University. This is a special release in addition to the LDC scheduled corpora for membership year 2016, available under separate terms.

This corpus consists of 82 transcribed dialogues from call center inquiries annotated for dialogue acts. Data was collected from call centers for banks, airlines and mobile network providers in the form of spontaneous spoken telephone dialogues (52) and instant messaging dialogues (30) amounting to over 20,000 words.

Not-for-profit organizations may license this data set for a fee under the LDC Not-for-Profit Membership Agreement or under the LDC User Agreement for Non-Members for use in linguistic research, education and non-commercial technology development. For-profit organizations may license this data for a fee under a commercial license.

(2) Multi-Language Conversational Telephone Speech 2011 – Slavic Group was developed by LDC and is comprised of approximately 60 hours of telephone speech in Polish, Russian and Ukrainian. The data was collected to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects.

Call were made using LDC’s telephone collection infrastructure. Human auditors labeled calls for gender, dialect type and noise.  Audio data is presented in FLAC-compressed MS-WAV (RIFF) file format. Each uncompressed file is two channels, recorded at 8000 samples/second with samples stored as 16-bit signed integers.

Multi-Language Conversational Telephone Speech 2011 – Slavic Group is distributed via web download.

2016 Subscription Members will receive copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 190 hours of Georgian conversational and scripted telephone speech collected in 2014-2015 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Georgian speech in this release represents that spoken in the Eastern and Western dialect regions in Georgia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 73 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a is distributed via web download.

2016 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) GALE Phase 3 and 4 Chinese Newswire Parallel Text was developed by LDC and contains Chinese source text and corresponding English translations selected from newswire data collected by LDC in 2007-2008 and translated by LDC or under its direction.

This release includes 367 source-translation document pairs drawn from five distinct newswire sources, comprising 210,048 tokens of Chinese source text and its English translation. Source data and translations are distributed in TDF format. All data is encoded in UTF-8.

GALE Phase 3 and 4 Chinese Newswire Parallel Text is distributed via web download.

2016 Subscription Members will receive copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, October 19, 2016

LDC October 2016 Newsletter



In this newsletter:

Fall 2016 LDC Data Scholarship recipients

Chilin HK and LDC partner on distribution of parallel patent data

New publications:





Fall 2016 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Fall 2016 data scholarships:

Tiba Zaki Abdulhameed: Western Michigan University (USA); PhD Candidate, Computer Science. Tiba is awarded copies of GALE Phase 2 Arabic Broadcast Conversation Speech and Transcripts for her research in dialectal ASR.

Abhishek Abhishek: Indian Institute of Technology Guwahati (India); PhD Candidate, Computer Science and Engineering. Abhishek is awarded a copies of ACE 2004 Multilingual Training Corpus and The New York Times Annotated Corpus for his research in coreference resolution and relation extraction.

Sara Ebrahim: Ain Shams University (Egypt); Msc, Computer Science. Sara is awarded copies of LDC Standard Arabic Morphological Analyzer and NIST OpenMT 2008 Evaluation Selected References and System Translations for her work in machine translation.

Katherine Metcalf: Indiana University (USA), PhD Candidate, Computer Science. Katherine is awarded a copy of Emotional Prosody Speech and Transcripts for her research in acoustic/prosodic approaches to classifying emotional states.

Mousmita Sarma: Gauhati University (India), Post-Masters Research, Electronics and Communications Technology. Mousmita is awarded copies of Switchboard 1-Release 2 and IARPA Babel Assamese Language Pack for her research in Assamese dialect identification.

For program information visit the Data Scholarship page.


Chilin HK and LDC partner on distribution of parallel patent data

Chilin HK Limited (Chilin) and LDC are pleased to announce that the parallel data source developed by Chilin, A Corpus of Chinese-English Parallel Sentences Extracted from Patents, is now available through the LDC Catalog. This is a special release in addition to the LDC scheduled corpora for membership year 2016, available under separate terms.

The Chilin Corpus has primarily resulted from training corpus and test sets developed specifically for the Tokyo-based NTCIR 2009 & 2010 competitions on Patent MT (machine translation), which drew more than 30 international teams:


The training corpus is drawn from a much larger curated corpus of parallel Chinese-English sentences and sentence fragments which have been winnowed from an even larger corpus of more than 300k parallel Chinese-English patents in different fields, initially at the Research Centre on Language Information Sciences, City University of Hong Kong (authors:  Benjamin Tsou, Bin Lu, and Kapo Chow). This data set is available from LDC under the following reference:


Not-for-profit organizations may license this data set for US$25.00 under the LDC Not-for-Profit Membership Agreement or under the LDC User Agreement for Non-Members for use in linguistic research, education and non-commercial technology development. For-profit organizations may license this data for US$5000, discounted to US$4000 for LDC for-profit members, under a commercial license.

New Corpora

(1) IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 213 hours of Turkish conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Turkish speech in this release represents that spoken in seven dialect regions in Turkey. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Turkish Language Pack IARPA is distributed via web download.

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the special license agreement. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) KAFD: Arabic Font Database was developed by King Fahd University of Petroleum & Minerals and Qassim University. It is comprised of approximately 2.5 million scanned Arabic printed pages in a variety of fonts, sizes and resolutions along with corresponding transcripts. KAFD was designed for research in Arabic text recognition.

The scanned Arabic texts were collected from publications covering various subjects such as religion, medicine, science and history. Texts were printed in 40 different fonts, 10 sizes and four styles. Scans were made at 100, 200, 300 and 600 dpi (dots per inch).

The database is available in two formats: at the page level and at the line level. Images are presented as TIFF images and transcripts are in plain text format. Individual font folders are compressed into RAR archives.

The data is divided into training, validation and test sets.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

(3) Richer Event Description was developed by the University of Colorado Boulder-CLEAR (Computational Language and Education Research), Carnegie Mellon University and LDC. It consists of coreference, bridging and event-event relations (temporal, causal, subevent and reporting relations) annotations over 95 English newswire, discussion forum and narrative text documents, covering all events, times and non-eventive entities within each document.
RED annotation is intended to join different annotation layers and to provide a rich representation of event phenomena.

Documents were annotated twice -- in a markable pass and in an event annotation phase. Annotation and source documents are divided into three partitions: (1) 20 newswire summarization documents, (2) 20 discussion forum documents and newswire annotations used in the original RED pilot annotations, and (3) 55 documents annotated by a range of DEFT (Deep Exploration and Filtering of Test) annotation formats. Data is presented as UTF-8 encoded xml and plain text.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.