Friday, April 13, 2018

LDC 2018 April Newsletter


LDC at ICASSP 2018

LDC at the Philadelphia Science Carnival

New Publications:
_____________________________________________________________________
LDC at ICASSP 2018
LDC will be exhibiting at ICASSP 2018, held this year April 15-20 in Calgary, Canada. Stop by booth B2 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Enhancement and Analysis of Conversational Speech: JSALT 2017
Tuesday, April 17, 16:00 - 18:00
Session: Speech Analysis

Leveraging LSTM Models for Overlap Detection in Multi-Party Meetings
Wednesday, April 18, 13:30 - 15:30
Session: Speaker Diarization & Identification

A Novel LSTM-based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions
Wednesday, April 18, 13:30 - 15:30
Session: Speaker Diarization & Identification

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!  

LDC at the Philadelphia Science Carnival
LDC will share the fun of language with the community on Saturday, April 28, with a booth at the Philadelphia Science Carnival. Visitors will enjoy three language-oriented educational activities that include a language identification game and Chinese character recognition.

The Philadelphia Science Carnival is an annual event organized by Philadelphia’s Franklin Institute to acquaint children and adults with the joys of science.


New publications:

(1) Concretely Annotated New York Times was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to The New York Times Annotated Corpus (LDC2008T19). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus.
Concretely Annotated New York Times is distributed via hard drive.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed The New York Times Annotated Corpus (LDC2008T19) may request a copy of Concretely Annotated New York Times (LDC2018T12) for a $250 media fee.  Non-members may license this data for a fee.

*

(2) H2, E2, ERK1 Children's Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in this corpus was collected by elementary schools in Baden Württemberg, Germany, and digitized at the Cooperative State University during the 2016/2017 school year. Three second, third, and fourth grade classrooms participated in the collection. Texts were written within regular class settings. The students were presented with a picture and were asked to write a story to describe the picture or, if unable to write a text, to list what they saw in the picture.

There were 173 total participants. 100 students were multilingual, and further metadata is available for 166 of the 173 children. The following is included for each text in the database: school week of collection; school type; age; gender; grade/classroom; language spoken at home; and school materials used.

LDC has also released H1 Children's Writing (
LDC2016T01).

H2, E2, ERK1 Children's Writing is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TRAD Arabic-French Parallel Text -- Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. This release consists of 398 segments (translations units) from 17 documents. The source data is Arabic newsgroup text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program.

LDC has also released TRAD Chinese-French Parallel Text -- Blog (LDC2018T02).

TRAD Arabic-French Parallel Text -- Newsgroup is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, March 15, 2018

LDC March 2018 Newsletter


New Publications:
______________________________________________________________________

New publications:

(1) BOLT Arabic Discussion Forums was developed by LDC and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The material in this release represents the unannotated Arabic source data in the discussion forum genre.

Collection was seeded based on the results of manual data scouting by native speaker annotators. Scouts were instructed to seek content in Egyptian Arabic that was original, interactive and informal. Upon locating an appropriate thread, scouts submitted the URL and some simple judgments about it to a database, via a web browser plug-in. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Arabic content. It should also be noted that many threads may contain a mixture of Egyptian and other varieties of Arabic, even among the threads that are primarily Arabic.

BOLT Arabic Discussion Forums is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 
*

(2) LORELEI Somali Representative Language Pack - Monolingual and Parallel Text was developed by LDC and is comprised of approximately 13 million words of monolingual Somali text, approximately 800,000 of which are translated into English. Another 100,000 words are also translated from English into Somali. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.

Data was collected in the following genres: discussion forums, news, reference, social network and weblog. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods, which are detailed in the included documentation. All harvested content was initially converted from its original HTML form into a relatively uniform XML format. Also included in this release are two tools: one to recreate original source data from the processed XML material and the other to condition text data users download from Twitter.

LORELEI Somali Representative Language Pack - Monolingual and Parallel Text is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
 *

(3) SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora and separated into development and test sets.

Reference translations from machine translation evaluation corpora were used as sentential paraphrases. They were sourced from the following data sets released by LDC from the NIST (National Institute of Standards and Technology) open machine translation evaluation series (OpenMT): LDC2010T14, LDC2010T17, LDC2010T21, and LDC2013T03.

Reference translations of 10 to 30 words were randomly extracted for annotation from NIST OpenMT corpora. Gold standard annotations of HPSG (head-driven phrase structure grammar) trees and phrase alignments were performed, resulting in 20,276 phrases extracted from 201 sentential paraphrases and 15,721 paraphrase alignments.

SPADE is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, February 16, 2018

LDC February 2018 Newsletter

Only two weeks left to enjoy 2018 membership discounts
Spring 2018 LDC Data Scholarship recipients
Only two weeks left to enjoy 2018 membership discounts

There is still time to save on 2018 membership fees. Through March 1, all organizations receive a discount on the 2018 membership fee (up to 10%) when they choose to join or renew.

For more information on membership benefits, visit Join LDC.


Spring 2018 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2018 Data Scholarship:

Margarida Madaleno: London School of Economics, PhD Economic Geography. Madelano is awarded a copy of Treebank 3 for her research in emotional well-being.

Gary Munnelly: Trinity College Dublin, PhD Computer Science and Statistics. Munnelly is awarded a copy of the New York Times Annotated Corpus for his research in named entity recognition and disambiguation in cultural heritage data sets.

Barlian Henryanu Prasetio: University of Miyazaki, PhD Environmental Robotics. Prasetio is awarded copies of SUSAS and SUSAS Transcripts for his work in voice stress recognition systems.

For information about the program, visit the Data Scholarship page.


LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 – Central Asian was developed by LDC and is comprised of approximately 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance.
LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:
Multi-Language Conversational Telephone Speech 2011 – Central Asian is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text was developed by LDC and is comprised of approximately 25 million words of monolingual Amharic text, approximately 600,000 of which are translated into English. Another 80,000 words are also translated from English into Amharic. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.

Data was collected in the following genres: discussion forums, news, reference, social network and weblog. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods, which are detailed in the included documentation. All harvested content was initially converted from its original HTML form into a relatively uniform XML format. Also included in this release are two tools: one to recreate original source data from the processed XML material and the other to condition text data users download from Twitter.

LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by LDC and contains the 3,877,207 English source documents used in support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

The source data consists of newswire, broadcast material, and web text collected by LDC. Documents are released as a collection of zip files for overall compactness, and ease and efficiency of use. When unpacked, the documents are all UTF-8 text files with a basic markup structure.

TAC KBP Comprehensive English Source Corpora 2009-2014 is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(4) IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 200 hours of Tok Pisin conversational and scripted telephone speech collected in 2013 along with corresponding transcripts.

The Tok Pisin speech in this release represents that spoken in the Papuan dialect region of Papua New Guinea. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, January 16, 2018

LDC January 2018 Newsletter

Membership Discounts for MY2018 Still Available

New Publications:
___________________________________________________________________________

Membership Discounts for MY2018 Still Available
Join LDC while membership savings are still available. Now through March 1, 2018, renewing MY2017 members will receive a 10% discount off the membership fee. New or non-consecutive member organizations will receive a 5% discount. Membership remains the most economical way to access LDC releases. This year’s planned publications include Multilanguage Conversational Telephone Speech, IARPA Babel Language Packs (telephone speech and transcripts), DIRHA (Distant-speech Interaction for Robust Home Applications), TRAD (Chinese-French and Arabic-French parallel text), data from BOLT, DEFT, LORELEI, RATS and TAC KBP, and more. Browse the Members pages for details on membership options and benefits. 

New publications:

(1) DEFT Spanish Treebank was developed by LDC and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text and Latin American Spanish discussion forum data created for the DARPA Deep Exploration and Filtering of Text (DEFT) program. DEFT Spanish Treebank supported the program's goal of deep natural language understanding.

Newswire source files were selected from Spanish Gigaword Third Edition (LDC2011T12) and were manually sentence-segmented for DEFT. Discussion forum source files were selected from Spanish discussion forum source data collected by LDC, consisting of continuous multi-posts of 100-1000 words.

This release contains 114 files (54,394 tokens) of newswire data and 60 files (55,307 tokens) of discussion forum data all of which were annotated with constituents and syntactic functions.

DEFT Spanish Treebank is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. It is comprised of approximately 85 hours of real and simulated read speech by six native American English speakers. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text.

Speech was collected in a real apartment setting with typical domestic background noise and inter/intra-room reverberation effects. Annotations, speaker metadata and images of the apartment setting are also included.

DIRHA English WSJ Audio is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TRAD Chinese-French Parallel Text -- Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06).

The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.

The source data for TRAD Chinese-French Parallel Text is Chinese blog text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.

TRAD Chinese-French Parallel Text -- Blog is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, December 15, 2017

LDC December 2017 Newsletter

Spring 2018 LDC Data Scholarship Program - deadline approaching

Lingo Boingo: a web portal to language games

__________________________________________________________________________

Spring 2018 LDC Data Scholarship Program - deadline approaching
Students can apply for the Spring 2018 Data Scholarship Program now through January 15, 2018. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships

Lingo Boingo: a web portal to language games
LDC is pleased to announce a new collaborative project, Lingo Boingo (https://lingoboingo.org/), a web portal that brings together new and existing language games that are fun to play and that provide useful annotations and judgments for linguistic research. Gamers and grammar lovers can choose from a list of challenging games, which will continue to expand through the efforts of LDC and external collaborators. For more information, contact jfiumara@ldc.upenn.edu. Start playing today!

Renew your LDC membership today
Membership Year 2018 (MY2018) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2018, current MY2017 members who renew before March 1, will receive a 10% discount off of the membership fee. New or returning organizations will receive a 5% discount through March 1. 

In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 700 holdings; current year for-profit members may use most data for commercial applications. Visit Join LDC for details on membership, user accounts and payment.

Plans for MY2018 publications are in progress. Among the expected releases are:
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
  • DIRHA (Distant-speech Interaction for Robust Home Applications):  Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
  • TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
  • BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
  • DEFT: Spanish Treebank (newswire, web data)
  • RATS:  Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
  • TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
  • German children’s handwriting: longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns

New publications:

(1) CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition Challenge and contains approximately 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments. CHiME3 involved two types of data: speech data recorded in very noisy environments (on a bus, in a cafe, pedestrian area, and street junction) and noisy utterances generated by artificially mixing clean speech data with noisy backgrounds.

Data is divided into training, development and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The audio data consists of the background noises, enhanced speech data using the baseline speech enhancement technique, unsegmented noisy speech data, and segmented noisy speech data.

LDC has also released two CHiME2 corpora -- CHiME2 Grid and CHiME2 WSJ0.

CHiME3 is distributed via USB drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(2) GALE Phase 4 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 134 hours of Mandarin Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News Transcripts (LDC2017T18).

The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: China Central TV (CCTV), a national and international broadcaster in Mainland China; Phoenix TV, a Hong Kong-based satellite television station; and Voice of America (VOA), a U.S. government-funded broadcast programmer.

This release contains 256 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 4 Chinese Broadcast News Speech is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 
*
 
(3) GALE Phase 4 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News Speech (LDC2017S25).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,696,879 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

GALE Phase 4 Chinese Broadcast News Transcripts is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.



Friday, November 17, 2017

LDC November 2017 Newsletter

Join LDC for Membership Year 2018

Spring 2018 Data Scholarship Program
Commercial use and LDC data
____________________________________________________________________

Join LDC for Membership Year 2018

Membership Year 2018 (MY2018) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2018, current MY2017 members who renew before March 1 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 1.

In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 700 holdings; current year for-profit members may use most data for commercial applications.

Plans for MY2018 publications are in progress. Among the expected releases are:
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
  • DIRHA (Distant-speech Interaction for Robust Home Applications):  Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
  • TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
  • BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
  • DEFT: Spanish Treebank (newswire, web data)
  • RATS:  Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
  • TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
  • German children’s handwriting: longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns
And don’t forget, MY2017 and MY2016 are still open for joining. MY2016 can be joined through December 31, 2017 and includes data such as BOLT Chinese Discussion Forums, IARPA Babel Language Packs in multiple languages and Multi-Language Conversational Telephone Speech – Slavic Group. MY 2017 will remain open through December 31, 2018; among the year’s releases are 2010 NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting, Noisy TIMIT Speech and BOLT Egyptian Arabic SMS/Chat and Transliteration. For full descriptions of these data sets, browse our Catalog.  
Visit Join LDC for details on membership, user accounts and payment.

Spring 2018 Data Scholarship Program
Applications are now being accepted through January 15, 2018 for the Spring 2018 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements. 

Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information. 

New publications:

(1) ASpIRE Development and Development Test Sets was developed for the Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the Intelligent Advanced Research Projects Activity). It contains approximately 226 hours of English speech with transcripts and scoring files.

The audio data is a subset of Mixer 6 Speech (LDC2013S03), audio recordings of interviews, transcript readings and conversational telephone speech collected by LDC in 2009 and 2010 from native English speakers local to the Philadelphia area. The transcripts were developed by Appen for the ASpIRE challenge.

Data is divided into development and development test sets.

ASpIRE Development and Development Test Sets is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) CIEMPIESS Light (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish radio and television speech and associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Light is an updated version of CIEMPIESS, released by LDC as LDC2015S07. This "light" version contains speech and transcripts presented in a revised directory structure that allows for use with the Kaldi toolkit.

The speech recordings were collected from Podcast UNAM, a program created by Radio-IUS, and Mirador Universitario, a TV program broadcast by UNAM. They are comprised of spontaneous conversations in Mexican Spanish between a moderator and guests.


The audio files are in 16 kHz, 16-bit PCM flac format, and transcripts are presented as UTF-8 encoded plain text.

CIEMPIESS Light is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kurmanji Kurdish conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Kurmanji Kurdish speech in this release represents that spoken in the southeastern and eastern Anatolian regions of Turkey. The gender distribution among speakers is approximately 37% female and 63% male; speakers' ages range from 16 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) TACKBP Chinese Cross-lingual Entity Linking - Comprehensive Training & Evaluation Data 2011-2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Chinese Cross-lingual Entity Linking tasks in 201120122013 and 2014. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Chinese newswire, discussion forum and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16).

The goal of TAC KBP’s entity linking track is to measure systems’ ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base and if so, to create a link between the two. If there is no matching node, entity linking systems are required to cluster the mention together with others referencing the same entity. More information about the TAC KBP Entity Linking task and other TAC KBP evaluations can be found on the NIST TAC website.

TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, October 18, 2017

LDC October 2017 Newsletter

LDC Awards Fall Data Scholarships

Membership Year 2018 Publication Preview

New Publications:RATS Keyword Spotting
MWE-Aware English Dependency Corpus Version 2.0 _________________________________________________________________________

LDC Awards Fall Data Scholarships
LDC is pleased to award fifteen data scholarships to students this fall. Recipients are from eight countries and a variety of academic disciplines. Twenty unique data sets are awarded to the students for their work in diverse applications including machine translation, abstractive text summarization using recurrent neural networks, speech recognition for multiple languages, semantic role labeling for social data, text summarization, speaker recognition for forensic applications, and more. Please look to LDC’s social media pages for upcoming announcements highlighting each recipient and their intended research.  Congratulations to all of our recipients! 

Membership Year 2018 Publication Preview
The 2018 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
  • DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
  • TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
  • BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
  • DEFT: Spanish Treebank (newswire, web data)
  • RATS Language Identification data set  (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
  • TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
  • German children’s handwriting (longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns)
Check your inbox in the coming weeks for more information about membership renewal.



New publications:

(1) RATS Keyword Spotting was developed by  LDC and is comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts, and keywords generated from transcript content. The corpus was created to provide training, development, and initial test sets for the keyword spotting (KWS) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic and Farsi speakers; and (2) material from Levantine Arabic QT Training Data Set 5, Speech (LDC2006S29) and CALLFRIEND Farsi Second Edition Speech (LDC2014S01). Transcripts of calls were either produced or available from the source corpora. Potential target keywords were selected from the transcripts based on word frequencies to fall within a range of target-word likelihood per hour of speech. The selected words were manually reviewed to confirm that each was a regular or multi-word expression of more than three syllables.

RATS Keyword Spotting is distributed via hard drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) English Web Treebank Propbank was developed by  University of Colorado Boulder - CLEAR (Computational Language and Education Research) and provides predicate-argument structure annotation for English Web Treebank (LDC2012T13).

The goal of Propbank (or proposition bank) annotation is to develop annotations with information about basic semantic propositions. English Web Treebank Propbank provides semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses, and all nouns considered to be predicative. Mark-up is in the "unified" propbank annotation format, which combines representations in nouns, verbs, and adjectives. The source data consists of weblogs, newsgroups, email, reviews, and questions-answers.

English Web Treebank Propbank is distributed via Web Download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
 
(3)  Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). This release is part of a continuing project to develop a large, part-of-speech tagged ancient Chinese corpus. It consists of 180,000 Chinese characters and 195,000 segment units (including words and punctuation). The part-of-speech tag set was developed by Nanjing Normal University and contains 17 tags. The files are presented in UTF-8 plain text files using traditional Chinese script.

Ancient Chinese Corpus is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from OntoNotes Release 5.0 (LDC2013T19).

Version 2.0 adds annotations of named entities (persons, locations, organizations) into dependency trees that are aware of compound function words. Version 1.0 is available from LDC as MWE-Aware English Dependency Corpus (LDC2017T01).

MWEs (multiword expressions) were identified in OntoNotes' phrase structure trees and each MWE was established as a single subtree. Those phrase structure subtrees were then converted to a dependency structure (the Stanford dependencies) in CoNLL format. The data is split into 1,728 phrase structure trees as *.parse files and a single 14-column tab separated dependency as a *.conll file. Both file types are encoded as UTF-8.

MWE-Aware English Dependency Corpus Version 2.0 is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.