Friday, December 15, 2017

LDC December 2017 Newsletter

Spring 2018 LDC Data Scholarship Program - deadline approaching

Lingo Boingo: a web portal to language games

__________________________________________________________________________

Spring 2018 LDC Data Scholarship Program - deadline approaching
Students can apply for the Spring 2018 Data Scholarship Program now through January 15, 2018. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships

Lingo Boingo: a web portal to language games
LDC is pleased to announce a new collaborative project, Lingo Boingo (https://lingoboingo.org/), a web portal that brings together new and existing language games that are fun to play and that provide useful annotations and judgments for linguistic research. Gamers and grammar lovers can choose from a list of challenging games, which will continue to expand through the efforts of LDC and external collaborators. For more information, contact jfiumara@ldc.upenn.edu. Start playing today!

Renew your LDC membership today
Membership Year 2018 (MY2018) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2018, current MY2017 members who renew before March 1, will receive a 10% discount off of the membership fee. New or returning organizations will receive a 5% discount through March 1. 

In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 700 holdings; current year for-profit members may use most data for commercial applications. Visit Join LDC for details on membership, user accounts and payment.

Plans for MY2018 publications are in progress. Among the expected releases are:
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
  • DIRHA (Distant-speech Interaction for Robust Home Applications):  Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
  • TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
  • BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
  • DEFT: Spanish Treebank (newswire, web data)
  • RATS:  Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
  • TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
  • German children’s handwriting: longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns

New publications:

(1) CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition Challenge and contains approximately 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments. CHiME3 involved two types of data: speech data recorded in very noisy environments (on a bus, in a cafe, pedestrian area, and street junction) and noisy utterances generated by artificially mixing clean speech data with noisy backgrounds.

Data is divided into training, development and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The audio data consists of the background noises, enhanced speech data using the baseline speech enhancement technique, unsegmented noisy speech data, and segmented noisy speech data.

LDC has also released two CHiME2 corpora -- CHiME2 Grid and CHiME2 WSJ0.

CHiME3 is distributed via USB drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(2) GALE Phase 4 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 134 hours of Mandarin Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News Transcripts (LDC2017T18).

The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: China Central TV (CCTV), a national and international broadcaster in Mainland China; Phoenix TV, a Hong Kong-based satellite television station; and Voice of America (VOA), a U.S. government-funded broadcast programmer.

This release contains 256 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 4 Chinese Broadcast News Speech is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 
*
 
(3) GALE Phase 4 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News Speech (LDC2017S25).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,696,879 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

GALE Phase 4 Chinese Broadcast News Transcripts is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.



Friday, November 17, 2017

LDC November 2017 Newsletter

Join LDC for Membership Year 2018

Spring 2018 Data Scholarship Program
Commercial use and LDC data
____________________________________________________________________

Join LDC for Membership Year 2018

Membership Year 2018 (MY2018) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2018, current MY2017 members who renew before March 1 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 1.

In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 700 holdings; current year for-profit members may use most data for commercial applications.

Plans for MY2018 publications are in progress. Among the expected releases are:
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
  • DIRHA (Distant-speech Interaction for Robust Home Applications):  Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
  • TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
  • BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
  • DEFT: Spanish Treebank (newswire, web data)
  • RATS:  Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
  • TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
  • German children’s handwriting: longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns
And don’t forget, MY2017 and MY2016 are still open for joining. MY2016 can be joined through December 31, 2017 and includes data such as BOLT Chinese Discussion Forums, IARPA Babel Language Packs in multiple languages and Multi-Language Conversational Telephone Speech – Slavic Group. MY 2017 will remain open through December 31, 2018; among the year’s releases are 2010 NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting, Noisy TIMIT Speech and BOLT Egyptian Arabic SMS/Chat and Transliteration. For full descriptions of these data sets, browse our Catalog.  
Visit Join LDC for details on membership, user accounts and payment.

Spring 2018 Data Scholarship Program
Applications are now being accepted through January 15, 2018 for the Spring 2018 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements. 

Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information. 

New publications:

(1) ASpIRE Development and Development Test Sets was developed for the Automatic Speech recognition In Reverberant Environments (ASpIRE) Challenge sponsored by IARPA (the Intelligent Advanced Research Projects Activity). It contains approximately 226 hours of English speech with transcripts and scoring files.

The audio data is a subset of Mixer 6 Speech (LDC2013S03), audio recordings of interviews, transcript readings and conversational telephone speech collected by LDC in 2009 and 2010 from native English speakers local to the Philadelphia area. The transcripts were developed by Appen for the ASpIRE challenge.

Data is divided into development and development test sets.

ASpIRE Development and Development Test Sets is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) CIEMPIESS Light (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Light was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish radio and television speech and associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Light is an updated version of CIEMPIESS, released by LDC as LDC2015S07. This "light" version contains speech and transcripts presented in a revised directory structure that allows for use with the Kaldi toolkit.

The speech recordings were collected from Podcast UNAM, a program created by Radio-IUS, and Mirador Universitario, a TV program broadcast by UNAM. They are comprised of spontaneous conversations in Mexican Spanish between a moderator and guests.


The audio files are in 16 kHz, 16-bit PCM flac format, and transcripts are presented as UTF-8 encoded plain text.

CIEMPIESS Light is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kurmanji Kurdish conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Kurmanji Kurdish speech in this release represents that spoken in the southeastern and eastern Anatolian regions of Turkey. The gender distribution among speakers is approximately 37% female and 63% male; speakers' ages range from 16 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Kurmanji Kurdish Language Pack IARPA-babel205b-v1.0a is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) TACKBP Chinese Cross-lingual Entity Linking - Comprehensive Training & Evaluation Data 2011-2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Chinese Cross-lingual Entity Linking tasks in 201120122013 and 2014. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities along with the source documents for the queries, specifically, English and Chinese newswire, discussion forum and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16).

The goal of TAC KBP’s entity linking track is to measure systems’ ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base and if so, to create a link between the two. If there is no matching node, entity linking systems are required to cluster the mention together with others referencing the same entity. More information about the TAC KBP Entity Linking task and other TAC KBP evaluations can be found on the NIST TAC website.

TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, October 18, 2017

LDC October 2017 Newsletter

LDC Awards Fall Data Scholarships

Membership Year 2018 Publication Preview

New Publications:RATS Keyword Spotting
MWE-Aware English Dependency Corpus Version 2.0 _________________________________________________________________________

LDC Awards Fall Data Scholarships
LDC is pleased to award fifteen data scholarships to students this fall. Recipients are from eight countries and a variety of academic disciplines. Twenty unique data sets are awarded to the students for their work in diverse applications including machine translation, abstractive text summarization using recurrent neural networks, speech recognition for multiple languages, semantic role labeling for social data, text summarization, speaker recognition for forensic applications, and more. Please look to LDC’s social media pages for upcoming announcements highlighting each recipient and their intended research.  Congratulations to all of our recipients! 

Membership Year 2018 Publication Preview
The 2018 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
  • DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
  • TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
  • BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
  • DEFT: Spanish Treebank (newswire, web data)
  • RATS Language Identification data set  (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
  • TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
  • German children’s handwriting (longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns)
Check your inbox in the coming weeks for more information about membership renewal.



New publications:

(1) RATS Keyword Spotting was developed by  LDC and is comprised of approximately 3,100 hours of Levantine Arabic and Farsi conversational telephone speech with automatic and manual annotation of speech segments, transcripts, and keywords generated from transcript content. The corpus was created to provide training, development, and initial test sets for the keyword spotting (KWS) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic and Farsi speakers; and (2) material from Levantine Arabic QT Training Data Set 5, Speech (LDC2006S29) and CALLFRIEND Farsi Second Edition Speech (LDC2014S01). Transcripts of calls were either produced or available from the source corpora. Potential target keywords were selected from the transcripts based on word frequencies to fall within a range of target-word likelihood per hour of speech. The selected words were manually reviewed to confirm that each was a regular or multi-word expression of more than three syllables.

RATS Keyword Spotting is distributed via hard drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) English Web Treebank Propbank was developed by  University of Colorado Boulder - CLEAR (Computational Language and Education Research) and provides predicate-argument structure annotation for English Web Treebank (LDC2012T13).

The goal of Propbank (or proposition bank) annotation is to develop annotations with information about basic semantic propositions. English Web Treebank Propbank provides semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses, and all nouns considered to be predicative. Mark-up is in the "unified" propbank annotation format, which combines representations in nouns, verbs, and adjectives. The source data consists of weblogs, newsgroups, email, reviews, and questions-answers.

English Web Treebank Propbank is distributed via Web Download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
 
(3)  Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). This release is part of a continuing project to develop a large, part-of-speech tagged ancient Chinese corpus. It consists of 180,000 Chinese characters and 195,000 segment units (including words and punctuation). The part-of-speech tag set was developed by Nanjing Normal University and contains 17 tags. The files are presented in UTF-8 plain text files using traditional Chinese script.

Ancient Chinese Corpus is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from OntoNotes Release 5.0 (LDC2013T19).

Version 2.0 adds annotations of named entities (persons, locations, organizations) into dependency trees that are aware of compound function words. Version 1.0 is available from LDC as MWE-Aware English Dependency Corpus (LDC2017T01).

MWEs (multiword expressions) were identified in OntoNotes' phrase structure trees and each MWE was established as a single subtree. Those phrase structure subtrees were then converted to a dependency structure (the Stanford dependencies) in CoNLL format. The data is split into 1,728 phrase structure trees as *.parse files and a single 14-column tab separated dependency as a *.conll file. Both file types are encoded as UTF-8.

MWE-Aware English Dependency Corpus Version 2.0 is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, September 14, 2017

LDC September 2017 Newsletter

New Publications:

________________________________________________________________________

New publications:

(1) 2015-2016 CoNLL Shared Task contains the Chinese and English training, development and test data for the 2015 and 2016 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation which focused on shallow discourse parsing. This release consists of the tokenized, tagged, and parsed tags in English and Chinese. The English train, dev and test data are from Wall Street Journal material in Penn Discourse Treebank Version 2.0 (LDC2008T05); English blind test data are from wikinews. Chinese train, dev and test data are news material from Chinese Discourse Treebank 0.5 (LDC2014T21); Chinese blind test data are from wikinews.

LDC has also released the following CoNLL Shared Task data sets:
·         2006 CoNLL Shared Task - Ten Languages (LDC2015T11)
·         2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12)
·         2008 CoNLL Shared Task Data (LDC2009T12)
·         2009 CoNLL Shared Task Part 1 (LDC2012T03)
·         2009 CoNLL Shared Task Part 2 (LDC2012T04)

2015-2016 CoNLL Shared Task is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 211 hours of Zulu conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Zulu speech in this release represents that spoken in the KZN (KwaZulu-Natal)-urban dialect region of South Africa. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(3) SRI-FRTIV (Five-way Recorded Toastmaster Intrinsic Variation) was developed by SRI International in 2007-2008 and is comprised of approximately 232 hours of English speech from thirty-four speakers who were members of Toastmaster clubs. Participants were asked to speak at three different levels of effort (low, normal and high) in four different styles (interview, conversation, reading and oration) to study the question of how intrinsic variations -- associated with the speaker rather than the recording environment -- affect text-independent speaker verification.

Participants were native speakers of North American English who were members of local Toastmasters clubs and had experience in public speaking. This release includes demographic information for 30 speakers (15 male, 15 female), including gender, birth year, height, education level, years in Toastmasters, and a self-evaluation of speaking skills.

SRI-FRTIV is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) Vehicle City Voices Corpus – Part I was developed at the University of Michigan-Flint and is an ongoing oral history project and survey of English language variation in Flint, Michigan. It contains approximately 16 hours of speech with corresponding transcripts from interviews of Flint residents conducted between 2012 and 2015. The corpus was designed to provide high-quality recordings for acoustic analysis and to examine narrative structure and discursive construction of individual and collective identity in urban spaces.

This release is comprised of 21 interviews by undergraduate and graduate students for civic engagement projects in linguistics courses and by a graduate student research assistant. Participants (11 female, 10 male) were born between 1935 and 1991 and represented a range of ages, genders, and ethnicities. Of the interviewees, 11 were Black/African American, 8 were White/Caucasian, and 2 were biracial/mixed ethnic heritage.

Metadata (where provided by participants) includes information on gender, ethnicity, year of birth, level of education, field of employment, average income, length of time living in Flint and its surrounding areas, as well as interviewer age, gender, and ethnicity. In addition, original interview durations, edited interview durations, interview year, and transcript word counts are also provided in the metadata file.

Vehicle City Voices Corpus – Part I is available as a web download. 

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.



Tuesday, August 15, 2017

LDC August 2017 Newsletter

Fall 2017 LDC Data Scholarship program

LDC at Interspeech 2017

New Publications:
________________________________________________________________
Fall 2017 LDC Data Scholarship program - September 15 deadline approaching

There is still time to apply to the Fall 2017 LDC Data Scholarship program. Applications will be accepted through Friday September 15, 2017. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page. 

Applicants can email their materials to the LDC Data Scholarship program

LDC at Interspeech 2017

LDC will once again be exhibiting at Interspeech, held this year August 20-24 in Stockholm, Sweden. Stop by booth 17 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Speaker Comparison for Forensic and Investigative Applications III
LDC Executive Director, Chris Cieri, panelist for Topic A: “Process Map and Standardization”
Special Event Session, Wednesday August 23, 13:30-15:30, Hall B3

Call My Net Corpus: A Multilingual Corpus for Evaluation of Speaker Recognition Technology 
Karen Jones, Stephanie Strassel, Kevin Walker, David Graff, Jonathan Wright
Wednesday, August 23, 17:40-18:00 in the Agula Magna room 

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!   

New publications:
(1) Multi-Language Conversational Telephone Speech 2011 -- South Asian was developed by LDC and is comprised of approximately 118 hours of telephone speech in five distinct language varieties of South Asia (i.e. the Indian sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu. The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some which could be considered mutually intelligible or closely related.

Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type, and noise. Demographic information about the participants was not collected.

LDC has also released the following as part of the Multi-Language Conversation Telephone Speech 2011 series: Slavic Group (LDC2016S11)  and Turkish (LDC2017S09).

Multi-Language Conversational Telephone Speech 2011 -- South Asian is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(2) GALE Phase 4 Arabic Broadcast Conversation Speech was developed by LDC and is comprised of approximately 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast Conversation Transcripts (LDC2017T12).

This release contains 83 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. 

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Al Alam News Channel, based in Iran; Al Fayhaa, an Iraqi television channel; Al Hiwar, a regional broadcast station based in the United Kingdom; Alnurra, a U.S. government-funded regional broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, a broadcast station in the United Arab Emirates; Lebanese Broadcasting Corporation, a Lebanese television station; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Tunisian National TV, a national television station in Tunisia.

GALE Phase 4 Arabic Broadcast Conversation Speech is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(3) GALE Phase 4 Arabic Broadcast Conversation Transcripts was developed by LDC and contains transcriptions of approximately 75 hours of Arabic broadcast conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast Conversation Speech (LDC2017S15).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 475,211 tokens. The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 4 Arabic Broadcast Conversation Transcripts is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, July 19, 2017

LDC July 2017 Newsletter


LDC at ACL 2017

Fall 2017 Data Scholarship Program

New corpora:
IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
KSUEmotions
Metalogue Multi-Issue Bargaining Dialogue
_________________________________________________________________________

LDC at ACL 2017: July 31-August 2, Vancouver, Canada

ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Vancouver, Canada. Stop by our exhibition table to learn more about recent developments at the Consortium and new publications.

Fall 2017 Data Scholarship Program

Student applications for the Fall 2017 LDC Data Scholarship program are being accepted now through Friday, September 15, 2017, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page

Applicants can email their materials to the LDC Data Scholarship program

New corpora

(1) BOLT English Discussion Forums was developed by LDC and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

The material in this release represents the unannotated English source data in the discussion forum genre. Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-English content. Language identification was performed on all threads in this corpus (using CLD2).


BOLT English Discussion Forums is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains 200 hours of Tamil conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Tamil speech in this release represents that spoken in the Northern, Central, Southern and Western dialect regions of the Indian state of Tamil Nadu. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(3) KSUEmotions was developed by King Saud University (KSU) and contains approximately five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects. Speakers were from three countries: Yemen, Saudi Arabia and Syria.

Subjects read MSA sentences from newswire text in the following emotions: neutral, anger, sadness, happiness, surprise, and interrogative (asking a question). Human reviewers then listened to the recordings to identify the emotion they heard. Audio was recorded in each participant's home.


KSUEmotions is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.

The goal of the Metalogue project was to develop a dialogue system with flexible dialogue management to enable the system's behavior in setting goals, choosing strategies and monitoring various processes. Six unique subjects (undergraduates between 19 and 25 years of age) were involved in a multi-issue bargaining scenario in which a representative of a city council and a representative of small business owners negotiated the implementation of new anti-smoking regulations. The negotiation involved four issues, each with four or five options. Participants received a preference profile for each scenario and negotiated for an agreement with the highest value based on their preference information. Negotiators were not allowed to accept an agreement with a negative value or to share their preference profiles with other participants.

The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction.


Metalogue Multi-Issue Bargaining Dialogue is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee

Friday, June 16, 2017

LDC June 2017 Newsletter

New publications:
______________________________________________________

(1) Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12).

Abstract Meaning Representation (AMR) Annotation Release 2.0 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 166 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments.

CHiME2 WSJ0 reflects the medium vocabulary track of the CHiME2 Challenge. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text. Data is divided into training, development and test sets and includes baseline scoring, decoding and retraining tools. 

CHiME2 WSJ0 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) UCLA High-Speed Laryngeal Video and Audio was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings form nine subjects collected between April 2012 and April 2013. Speakers were asked to sustain the vowel /i/ for approximately ten seconds while holding voice quality, fundamental frequency, and loudness as steady as possible.

In the field of speech production theory, data such as contained in this release may be used to study the relationship between vocal folds vibration and resulting voice quality.

None of the subjects had a history of a voice disorder. There was no native language requirement for recruiting subjects; participants were native speakers of various languages, including English, Mandarin Chinese, Taiwanese Mandarin, Cantonese and German.


UCLA High-Speed Laryngeal Video and Audio is distributed via hard drive.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee