Wednesday, February 15, 2017

LDC February 2017 Newsletter

LDC Director Mark Liberman receives the IEEE James L. Flanagan Speech and Audio Processing Award
Only two weeks left to enjoy 2017 membership discounts
Spring 2016 LDC Data Scholarship recipients
New publications:
______________________________________________________________

LDC Director Mark Liberman receives the IEEE James L. Flanagan Speech and Audio Processing Award

LDC Director Mark Liberman is the 2017 recipient of the IEEE James L. Flanagan Speech and Audio Processing Award. Established in 2002, this annual award recognizes an individual for his or her outstanding contribution to the advancement of speech and/or audio processing. Liberman’s pioneering contributions and continued leadership in robust, replicable, and data-driven speech and language science and engineering have fueled the development and advancement of human language technologies including speech and speaker recognition, machine translation, and semantic analysis. As LDC’s founder, Mark has shepherded the Consortium from a small organization to the largest developer of shared language resources, distributing more than 120,000 copies of over 2,000 databases covering 91 different languages to more than 3,600 organizations in over 70 countries. 

Liberman will receive the award at ICASSP 2017 in New Orleans (March 5-9). LDC will be an exhibitor at Booth 43. Please stop by and say hello. We hope to see you there.   

Only two weeks left to enjoy 2017 membership discounts
There is still time to save on 2017 membership fees. Through March 1, all organizations receive a discount on the 2017 membership fee (up to 10%) when they choose to join or renew.  

For more information on membership benefits, visit Join LDC.

Spring 2016 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2017 data scholarship:

Umad Ul Hassan and Muhammad Awais Zulfiqar: National University of Sciences and Technology (Pakistan); BS Computer Science. Hassan and Zulfiqar are awarded copies of CSLU: Kids’ Speech Version 1.1 and The CMU Kids Corpus for their research in speech recognition for children with learning difficulties.

For information about the program, visit the Data Scholarship page.

New publications:
(1) First-Year Law Students' Court Memoranda consists of 197 English law student writing samples of legal briefs annotated for certain characteristics along with accompanying survey responses by student writers.

The briefs were created in a law school writing class at two law schools in the US Midwest during the 2011-12 academic year. Students who agreed to participate in this study uploaded their briefs to an online survey instrument and answered questions regarding their age, gender, level of education, most recent writing course and method of learning English. The study's purpose was to apply natural language processing approaches to determine any differences in the briefs' language attributable to the students' self-reported genders.

The samples were imported into the General Architecture for Text Engineering (GATE) and annotated by two human coders who identified large text segments specific to the legal genre in which the students wrote, such as text headings, citations, block quotes and footnotes.

Writing samples are presented as MS Word documents and annotations and survey responses are presented in XML format. The data has been anonymized to remove names and other identifying information about the student participants.

First-Year Law Students' Court Memoranda is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(2) IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Haitian Creole conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Haitian Creole speech in this release represents that spoken in the Northern, Western and Southern dialect regions in Haiti. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(3) GALE Phase 3 Arabic Broadcast News Speech Part 2 was developed by LDC and is comprised of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News Transcripts Part 2 (LDC2017T04).

The recordings in this corpus feature news broadcasts focusing principally on current events from various broadcast programmers including Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Iraqiyah, Aljazeera, Al Ordiniyah, Dubai TV, Kuwait TV, Lebanese Broadcast Corporation, Nile TV, Saudi TV and Syria TV.

This release contains 175 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker.

GALE Phase 3 Arabic Broadcast News Speech Part 2 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(4) GALE Phase 3 Arabic Broadcast News Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 128 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News Speech Part 2 (LDC2017S02). 

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 721,846 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Arabic Broadcast News Transcripts Part 2 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.



Thursday, January 19, 2017

LDC January 2017 Newsletter

LDC Membership Discounts for MY2017 Still Available

New publications:
___________________________________________________________________
LDC Membership Discounts for MY2017 Still Available
Join LDC now while membership savings are still available. 2016 members receive a 10% discount when renewing before March 1, 2017, or a 5% discount when renewing any time in 2017. Non-consecutive members and new members receive a 5% discount when renewing before March 1, 2017.  Membership remains the most economical way to access LDC releases.  This year’s planned publications include 2010 NIST Speaker Recognition Evaluation data set, Multilanguage Conversational Telephone Speech, Noisy TIMIT, IARPA Babel Language Packs, RATS Keyword Spotting, BOLT parallel and word-aligned data in all languages and more. Browse the Members pages for details on membership options and benefits.

New Corpora

(1) Arabic Speech Recognition Pronunciation Dictionary was developed by the Qatar Computing Research Institute. It contains approximately two million pronunciation entries for 526,000 Modern Standard Arabic words, for an average of 3.84 pronunciations for each grapheme word. The dictionary was developed from news archive resources, including the Arabic news website Aljazeera.net. The selected words were those that occurred more than once in the news collection. 
Arabic Speech Recognition Pronunciation Dictionary is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Vietnamese speech in this release represents that spoken in the North, North-Central, Central and Southern dialect regions in Vietnam. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee..
*
(3) MWE-Aware English Dependency Corpus was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from the Wall Street Journal portion of OntoNotes Release 5.0 (LDC2013T19).
Compound function words are a type of multiword expression (MWE). MWEs are groups of tokens that can be treated as a single semantic or syntactic unit. Doing so facilitates natural language processing tasks such as constituency and dependency parsing.
MWE-Aware English Dependency Corpus is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee..
*
(4) GALE Phase 3 and 4 Chinese Web Parallel Text was developed by LDC and contains Chinese source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.
The data includes 88 source-translation document pairs, comprising 67,514 tokens of Chinese source text and its English translation.
GALE Phase 3 and 4 Chinese Web Parallel Text is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, December 16, 2016

LDC December 2016 Newsletter

Renew your LDC membership today
Spring 2017 LDC Data Scholarship Program - deadline approaching
LDC to close for Winter Break
New publications:




_____________________________________________________________________
Renew your LDC membership today
Membership Year 2017 (MY2017) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2017, current MY2016 members who renew before March 1, will receive a 10% discount off of the membership fee. New or returning organizations will receive a 5% discount through March 1. 

In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of almost 700 holdings; current year for-profit members may use most data for commercial applications.
Plans for MY2017 publications are in progress. Among the expected releases are:
  • 2010 NIST Speaker Recognition Evaluation data set
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages
  • UCLA High Speed Laryngeal Database: audio recordings and high-speed video endoscopic images of the vocal folds while sustaining vowels
  • Noisy TIMIT: TIMIT with added artificial noise
  • CHiME shared task data: noisy read WSJ speech
  • First Year Law Students’ Memoranda: memos to a hypothetical court with annotations
  • IARPA Babel Language Packs: languages include Vietnamese, Haitian Creole, Zulu, Kazakh and Lithuanian
  • BOLT: source, parallel and word-aligned data in all languages
  • RATS Keyword Spotting data set
  • GALE Phases 3 and 4: all tasks and languages    

And don’t forget, MY2016 and MY2015 are still open for joining. MY2015 can be joined through December 31, 2016 and includes data such as RATS Speech Activity Detection and updates to Penn Treebank. MY 2016 will remain open through December 31, 2017 and includes data such as BOLT Chinese Discussion Forums, IARPA Babel Language Packs and Multi-Language Conversational Telephone Speech – Slavic Group. For full descriptions of these data sets, visit our Catalog.  
Visit Join LDC for details on membership, user accounts and payment.

Spring 2017 LDC Data Scholarship Program - deadline approaching
Students can apply for the Spring 2017 Data Scholarship Program now through January 16, 2017, 11:59PM EST. The LDC Data Scholarship program provides undergraduate and graduate students with access to LDC data at no cost.
For more information on application requirements and program rules, please visit LDC Data Scholarships. Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.

LDC to close for Winter Break
LDC will be closed from Monday, December 26, 2016 through Monday, January 2, 2017 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Tuesday, January 3, 2017. Requests received for membership renewals and corpora during the Winter Break will not be processed until the week of January 3.

New Corpora

(1) Bamanankan Lexicon was developed by LDC and contains 5,978 entries of the Bamanankan language presented as a Bamanankan-English lexicon and a Bamanankan-French lexicon. It is the third publication in an LDC project to build an electronic dictionary of three Mandekan languages: Mawukakan, Maninkakan and Bamanankan. These are Eastern Manding languages in the Mande Group of the Niger-Congo language family. LDC released a Mawukakan Lexicon (LDC2005L01) in 2005 and a Maninkakan Lexicon (LDC2013L01) in 2013.

This lexicon is presented using a Latin-based transcription system because the Latin alphabet is familiar to the majority of Mandekan language speakers and it is expected to facilitate the work of researchers interested in this resource.

Bamanankan Lexicon is distributed via web download.

2016 Subscription Members will receive copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 213 hours of Tagalog conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Tagalog speech in this release represents that spoken in the North, Central and South dialect regions in the Philippines. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Tagalog Language Pack IARPA-babel106-v0.2g is distributed via web download.

2016 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(3) TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Spanish Cross-lingual Entity Linking tasks in 2012, 2013 and 2014. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Spanish newswire, discussion forum and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16).

More information about the TAC KBP Entity Linking task and other TAC KBP evaluations can be found on the NIST TAC website.

TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014 is distributed via web download.

2016 Subscription Members will receive copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) GALE Phase 4 Arabic Newswire Parallel Sentences was developed by LDC and contains Modern Standard Arabic source text and corresponding English translations selected from newswire data collected by LDC in 2008 and translated by LDC or under its direction.

This release includes 393 source-translation document pairs drawn from six distinct newswire sources, comprising 62,669 tokens of Arabic source text and its English translation. Source data and translations are distributed in TDF format. All data is encoded in UTF-8.

GALE Phase 4 Arabic Newswire Parallel sentences is distributed via web download.

2016 Subscription Members will receive copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Tuesday, November 15, 2016

LDC November 2016 Newsletter



In this newsletter:

Join LDC for Membership Year 2017
Commercial use and LDC data
Spring 2017 Data Scholarship Program
LDC closed November 24-25 for US Thanksgiving Holiday

New publications:






Join LDC for Membership Year 2017

Organizations engaged in language-related research, education and technology development are invited to join LDC for Membership Year (MY) 2017. Consortium members enjoy unparalleled access and continuing rights to new data releases and to an archive of close to 700 holdings.

Membership fees have not increased for 2017. In addition, discounts are available for organizations who keep their membership current and for those who join before March 1, 2017.

           • MY 2016 members receive a 10% discount if they renew their membership before March 1, 2017. After March 1, MY2016 members receive a 5% discount if they renew their membership any time in 2017.
           • New members and returning former members receive a 5% discount off the membership fee if they join/renew before March 1, 2017.

Plans for MY2017 publications are in progress. Among the expected releases are:

2010 NIST Speaker Recognition Evaluation data set
Multilanguage conversational telephone speech: developed to support language identification research in related languages
UCLA High Speed Laryngeal Database: audio recordings and high-speed videoendoscopic images of the vocal folds while sustaining vowels
Noisy TIMIT: TIMIT with added artificial noise
CHiME shared task data: noisy read WSJ speech
First Year Law Students’ Memoranda: memos to a hypothetical court with annotations
IARPA Babel Language Packs: languages include Vietnamese, Haitian Creole, Zulu, Kazakh and Lithuanian
BOLT: source, parallel and word-aligned data in all languages
RATS Keyword Spotting data set
GALE Phases 3 and 4: all tasks and languages   

Visit Join LDC for details on membership, user accounts and payment.

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information. 

Spring 2017 Data Scholarship Program

Applications are now being accepted through January 15, 2017 for the Spring 2017 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for further information about program rules and submission requirements.

LDC closed November 24-25 for US Thanksgiving Holiday

LDC will be closed on Thursday, November 24, 2016 and Friday, November 25, 2016 in observance of the US Thanksgiving Holiday. The office will reopen on Monday, November 28, 2016.

New Corpora

(1) JANA: A Human-Human Dialogues Corpus for Egyptian Dialect was developed by researchers at Cairo University. This is a special release in addition to the LDC scheduled corpora for membership year 2016, available under separate terms.

This corpus consists of 82 transcribed dialogues from call center inquiries annotated for dialogue acts. Data was collected from call centers for banks, airlines and mobile network providers in the form of spontaneous spoken telephone dialogues (52) and instant messaging dialogues (30) amounting to over 20,000 words.

Not-for-profit organizations may license this data set for a fee under the LDC Not-for-Profit Membership Agreement or under the LDC User Agreement for Non-Members for use in linguistic research, education and non-commercial technology development. For-profit organizations may license this data for a fee under a commercial license.

(2) Multi-Language Conversational Telephone Speech 2011 – Slavic Group was developed by LDC and is comprised of approximately 60 hours of telephone speech in Polish, Russian and Ukrainian. The data was collected to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects.

Call were made using LDC’s telephone collection infrastructure. Human auditors labeled calls for gender, dialect type and noise.  Audio data is presented in FLAC-compressed MS-WAV (RIFF) file format. Each uncompressed file is two channels, recorded at 8000 samples/second with samples stored as 16-bit signed integers.

Multi-Language Conversational Telephone Speech 2011 – Slavic Group is distributed via web download.

2016 Subscription Members will receive copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 190 hours of Georgian conversational and scripted telephone speech collected in 2014-2015 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Georgian speech in this release represents that spoken in the Eastern and Western dialect regions in Georgia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 73 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a is distributed via web download.

2016 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) GALE Phase 3 and 4 Chinese Newswire Parallel Text was developed by LDC and contains Chinese source text and corresponding English translations selected from newswire data collected by LDC in 2007-2008 and translated by LDC or under its direction.

This release includes 367 source-translation document pairs drawn from five distinct newswire sources, comprising 210,048 tokens of Chinese source text and its English translation. Source data and translations are distributed in TDF format. All data is encoded in UTF-8.

GALE Phase 3 and 4 Chinese Newswire Parallel Text is distributed via web download.

2016 Subscription Members will receive copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.