Wednesday, July 19, 2017

LDC July 2017 Newsletter


LDC at ACL 2017

Fall 2017 Data Scholarship Program

New corpora:

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
KSUEmotions
Metalogue Multi-Issue Bargaining Dialogue
_________________________________________________________________________

LDC at ACL 2017: July 31-August 2, Vancouver, Canada

ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Vancouver, Canada. Stop by our exhibition table to learn more about recent developments at the Consortium and new publications.

Fall 2017 Data Scholarship Program

Student applications for the Fall 2017 LDC Data Scholarship program are being accepted now through Friday, September 15, 2017, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page

Applicants can email their materials to the LDC Data Scholarship program

New corpora

(1) BOLT English Discussion Forums was developed by LDC and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

The material in this release represents the unannotated English source data in the discussion forum genre. Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-English content. Language identification was performed on all threads in this corpus (using CLD2).

BOLT English Discussion Forums is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains 200 hours of Tamil conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Tamil speech in this release represents that spoken in the Northern, Central, Southern and Western dialect regions of the Indian state of Tamil Nadu. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(3) KSUEmotions was developed by King Saud University (KSU) and contains approximately five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects. Speakers were from three countries: Yemen, Saudi Arabia and Syria.

Subjects read MSA sentences from newswire text in the following emotions: neutral, anger, sadness, happiness, surprise, and interrogative (asking a question). Human reviewers then listened to the recordings to identify the emotion they heard. Audio was recorded in each participant's home.

KSUEmotions is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.

The goal of the Metalogue project was to develop a dialogue system with flexible dialogue management to enable the system's behavior in setting goals, choosing strategies and monitoring various processes. Six unique subjects (undergraduates between 19 and 25 years of age) were involved in a multi-issue bargaining scenario in which a representative of a city council and a representative of small business owners negotiated the implementation of new anti-smoking regulations. The negotiation involved four issues, each with four or five options. Participants received a preference profile for each scenario and negotiated for an agreement with the highest value based on their preference information. Negotiators were not allowed to accept an agreement with a negative value or to share their preference profiles with other participants.

The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction.

Metalogue Multi-Issue Bargaining Dialogue is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee

Friday, June 16, 2017

LDC June 2017 Newsletter

New publications:



______________________________________________________

(1) Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12).

Abstract Meaning Representation (AMR) Annotation Release 2.0 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 166 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments.

CHiME2 WSJ0 reflects the medium vocabulary track of the CHiME2 Challenge. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text. Data is divided into training, development and test sets and includes baseline scoring, decoding and retraining tools. 

CHiME2 WSJ0 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) UCLA High-Speed Laryngeal Video and Audio was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings form nine subjects collected between April 2012 and April 2013. Speakers were asked to sustain the vowel /i/ for approximately ten seconds while holding voice quality, fundamental frequency, and loudness as steady as possible.

In the field of speech production theory, data such as contained in this release may be used to study the relationship between vocal folds vibration and resulting voice quality.

None of the subjects had a history of a voice disorder. There was no native language requirement for recruiting subjects; participants were native speakers of various languages, including English, Mandarin Chinese, Taiwanese Mandarin, Cantonese and German.

UCLA High-Speed Laryngeal Video and Audio is distributed via hard drive.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee

Monday, May 15, 2017

LDC May 2017 Newsletter

In this newsletter:

Recent Collaborations

New publications:
______________________________________________________________

Recent Collaborations
Collaborations play an important role in many LDC activities. Over the past twenty-five years, LDC has partnered, consulted, and otherwise “collaborated” with a variety of organizations to advance research community goals. Recently, LDC partnered with Oxford Wave Research to integrate its latest speech technology into data collection and annotation processes. LDC also supports the Hearables Challenge sponsored by the National Science Foundation by creating and distributing training and test corpora. Finally, LDC Executive Director Chris Cieri is working with international colleagues to plan LREC2018 as a member of the Conference Programme Committee.
LDC welcomes new collaborations. Let us know what interests you and how we can work together. Contact LDC to begin the conversation.

New publications:

(1) IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Lao conversational and scripted telephone speech collected in 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Lao speech in this release represents that spoken in the Vientiane dialect region in Laos. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Lao Language Pack IARPA-babel203b-v3.1a is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Multi-Language Conversational Telephone Speech 2011 -- Turkish  was developed by LDC and is comprised of approximately 18 hours of telephone speech in Turkish. The data was collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE).

Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. Demographic information about the participants was not collected.

LDC has also released the Multi-Language Conversation Telephone Speech 2011 -- Slavic Group (LDC2016S11)

Multi-Language Conversational Telephone Speech 2011 -- Turkish is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Phrase Detectives Corpus was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" designed to collect data about English anaphoric coreference.

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. Annotations are comprised of a gold standard version created by multiple experts, as well as a set created by a large non-expert crowd (via the Phase Detectives game).

The data was annotated according to a prevalent linguistically-oriented approach for anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01) and The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Phrase Detectives Corpus is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*

(4) The EventStatus Corpus was developed by researchers at Texas A&M University, Stanford University and The University of Utah. It consists of approximately 3,000 English and 1,500 Spanish news articles about civil unrest events annotated with temporal tags.

This corpus was designed to support the study of the temporal and aspectual properties of major events, that is, whether an event has already happened, is currently happening or may happen in the future. Since it focuses on a single domain (civil unrest events), it may be appropriate for tasks such as event extraction and temporal question answering.

The relevant news articles were sourced from English Gigaword Fifth Edition (LDC2017T09) and Spanish Gigaword Third Edition (LDC2011T12). The civil unrest events include protests, demonstrations, marches and strikes.

The EventStatus Corpus is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

Monday, April 17, 2017

LDC April 2017 Newsletter

LDC celebrates 25 years

LDC data and commercial technology development

New publications:
_________________________________________________________________________

LDC celebrates 25 years
April 2017 marks the beginning of LDC’s 25th year as the leader in language resource development and distribution. Founded in 1992, the Consortium has grown from a data repository to a vibrant data center that creates, shares and archives language resources. The Catalog continues to grow, boasting over 700 titles in more than 90 languages. With the support of members, licensees, sponsors and collaborators, LDC has distributed over 120,000 copies of data to more than 3,500 organizations worldwide. Our heartfelt thanks for your support as we continue our mission to provide large quantities of diverse data, research program support and high quality member services.

LDC data and commercial technology development
Any organization wishing to use LDC data to develop or test products for commercialization or use LDC data in any commercial product or for any commercial purpose, must first license the data as a For-Profit Member. Once the data is licensed under the For-Profit Membership, the organization retains perpetual rights to use the data for commercial technology development. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for more information. 

New Corpora
(1) 2010 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and interview speech recorded over a microphone channel used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation (SRE).

The telephone speech segments include two-channel excerpts of approximately 10 seconds and 5 minutes. There are also summed-channel excerpts in the range of 5 minutes. The microphone excerpts are 3-15 minutes in duration. As in prior evaluations, intervals of silence were not removed.

The 2010 evaluation includes not only conversational telephone speech (CTS) recorded over ordinary telephone channels for the core training and test conditions, but also CTS and conversational interview speech recorded over a room microphone channel. Unlike prior evaluations, some of the conversational telephone style speech was collected in a manner to produce particularly high, or particularly low, vocal effort on the part of the speaker of interest. In addition to evaluation data, this package also consists of answer keys, trial and train files, development data and evaluation documentation.

2010 NIST Speaker Recognition Evaluation Test Set is distributed via hard drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) BOLT Egyptian Arabic SMS/Chat and Transliteration was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Egyptian Arabic. The corpus contains 5,691 conversations totaling 1,029,248 words across 262,026 messages. Messages were natively written in either Arabic orthography or romanized Arabizi. A total of 1,856 Arabizi conversations (287,022 words) were transliterated from the original romanized Arabizi script into standard Arabic orthography and then reviewed, corrected and normalized by LDC annotators according to "Conventional Orthography for Dialectal Arabic" (CODA).

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT Egyptian Arabic SMS/Chat and Transliteration is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(3) CHiME2 Grid was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 120 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments.

CHiME2 Grid reflects the small vocabulary track of the CHiME2 Challenge. The target utterances were taken from the Grid corpus and consist of 34 speakers reading simple 6-word sequences. The Data is divided into training, development and test sets.

CHiME2 Grid is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, March 17, 2017

LDC March 2017 Newsletter


New publications:

__________________________________________________________________

New publications:
(1) BOLT Chinese Discussion Forum Parallel Training Data was developed by LDC and consists of 1,876,799 tokens of Chinese discussion forum data collected for the DARPA BOLT program along with their corresponding English translations.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

The source data in this release consists of discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The full source data collection is released as BOLT Chinese Discussion Forums (LDC2016T05). Word-aligned and tagged data is released as BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training (LDC2016T19).

BOLT Chinese Discussion Forum Parallel Training Data is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 200 hours of Swahili conversational and scripted telephone speech collected from 2012-2014 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Swahili speech in this release represents that spoken in the Nairobi dialect region of Kenya. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been modified; the original arrangement of the TIMIT corpus is still as described by the TIMIT documentation.

The additive noise are white, pink, blue, red, violet and babble noise with levels varying in 5 dB (decibel) steps, ranging from 5 to 50 dB. The color noise types were generated artificially using MATLAB. The babble noise was selected from a random segment of recorded babble speech scaled relative to the power of the original TIMIT audio signal.

Noisy TIMIT Speech is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) GALE English-Chinese Parallel Aligned Treebank -- Training was developed by LDC and contains 196,123 tokens of word aligned English and Chinese parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

The English source data was translated into Chinese. Chinese and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this release corresponds to portions of the treebanked data in OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).

This release consists of English source broadcast programming (CNN, NBC/MSNBC) and web data collected by LDC in 2005 and 2006.

GALE English-Chinese Parallel Aligned Treebank – Training is distributed via web download

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Wednesday, February 15, 2017

LDC February 2017 Newsletter

LDC Director Mark Liberman receives the IEEE James L. Flanagan Speech and Audio Processing Award
Only two weeks left to enjoy 2017 membership discounts
Spring 2016 LDC Data Scholarship recipients
New publications:
______________________________________________________________

LDC Director Mark Liberman receives the IEEE James L. Flanagan Speech and Audio Processing Award

LDC Director Mark Liberman is the 2017 recipient of the IEEE James L. Flanagan Speech and Audio Processing Award. Established in 2002, this annual award recognizes an individual for his or her outstanding contribution to the advancement of speech and/or audio processing. Liberman’s pioneering contributions and continued leadership in robust, replicable, and data-driven speech and language science and engineering have fueled the development and advancement of human language technologies including speech and speaker recognition, machine translation, and semantic analysis. As LDC’s founder, Mark has shepherded the Consortium from a small organization to the largest developer of shared language resources, distributing more than 120,000 copies of over 2,000 databases covering 91 different languages to more than 3,600 organizations in over 70 countries. 

Liberman will receive the award at ICASSP 2017 in New Orleans (March 5-9). LDC will be an exhibitor at Booth 43. Please stop by and say hello. We hope to see you there.   

Only two weeks left to enjoy 2017 membership discounts
There is still time to save on 2017 membership fees. Through March 1, all organizations receive a discount on the 2017 membership fee (up to 10%) when they choose to join or renew.  

For more information on membership benefits, visit Join LDC.

Spring 2016 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2017 data scholarship:

Umad Ul Hassan and Muhammad Awais Zulfiqar: National University of Sciences and Technology (Pakistan); BS Computer Science. Hassan and Zulfiqar are awarded copies of CSLU: Kids’ Speech Version 1.1 and The CMU Kids Corpus for their research in speech recognition for children with learning difficulties.

For information about the program, visit the Data Scholarship page.

New publications:
(1) First-Year Law Students' Court Memoranda consists of 197 English law student writing samples of legal briefs annotated for certain characteristics along with accompanying survey responses by student writers.

The briefs were created in a law school writing class at two law schools in the US Midwest during the 2011-12 academic year. Students who agreed to participate in this study uploaded their briefs to an online survey instrument and answered questions regarding their age, gender, level of education, most recent writing course and method of learning English. The study's purpose was to apply natural language processing approaches to determine any differences in the briefs' language attributable to the students' self-reported genders.

The samples were imported into the General Architecture for Text Engineering (GATE) and annotated by two human coders who identified large text segments specific to the legal genre in which the students wrote, such as text headings, citations, block quotes and footnotes.

Writing samples are presented as MS Word documents and annotations and survey responses are presented in XML format. The data has been anonymized to remove names and other identifying information about the student participants.

First-Year Law Students' Court Memoranda is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(2) IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Haitian Creole conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Haitian Creole speech in this release represents that spoken in the Northern, Western and Southern dialect regions in Haiti. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(3) GALE Phase 3 Arabic Broadcast News Speech Part 2 was developed by LDC and is comprised of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News Transcripts Part 2 (LDC2017T04).

The recordings in this corpus feature news broadcasts focusing principally on current events from various broadcast programmers including Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Iraqiyah, Aljazeera, Al Ordiniyah, Dubai TV, Kuwait TV, Lebanese Broadcast Corporation, Nile TV, Saudi TV and Syria TV.

This release contains 175 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker.

GALE Phase 3 Arabic Broadcast News Speech Part 2 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(4) GALE Phase 3 Arabic Broadcast News Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 128 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News Speech Part 2 (LDC2017S02). 

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 721,846 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Arabic Broadcast News Transcripts Part 2 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.



Thursday, January 19, 2017

LDC January 2017 Newsletter

LDC Membership Discounts for MY2017 Still Available

New publications:
___________________________________________________________________
LDC Membership Discounts for MY2017 Still Available
Join LDC now while membership savings are still available. 2016 members receive a 10% discount when renewing before March 1, 2017, or a 5% discount when renewing any time in 2017. Non-consecutive members and new members receive a 5% discount when renewing before March 1, 2017.  Membership remains the most economical way to access LDC releases.  This year’s planned publications include 2010 NIST Speaker Recognition Evaluation data set, Multilanguage Conversational Telephone Speech, Noisy TIMIT, IARPA Babel Language Packs, RATS Keyword Spotting, BOLT parallel and word-aligned data in all languages and more. Browse the Members pages for details on membership options and benefits.

New Corpora

(1) Arabic Speech Recognition Pronunciation Dictionary was developed by the Qatar Computing Research Institute. It contains approximately two million pronunciation entries for 526,000 Modern Standard Arabic words, for an average of 3.84 pronunciations for each grapheme word. The dictionary was developed from news archive resources, including the Arabic news website Aljazeera.net. The selected words were those that occurred more than once in the news collection. 
Arabic Speech Recognition Pronunciation Dictionary is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Vietnamese speech in this release represents that spoken in the North, North-Central, Central and Southern dialect regions in Vietnam. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee..
*
(3) MWE-Aware English Dependency Corpus was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from the Wall Street Journal portion of OntoNotes Release 5.0 (LDC2013T19).
Compound function words are a type of multiword expression (MWE). MWEs are groups of tokens that can be treated as a single semantic or syntactic unit. Doing so facilitates natural language processing tasks such as constituency and dependency parsing.
MWE-Aware English Dependency Corpus is distributed via web download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee..
*
(4) GALE Phase 3 and 4 Chinese Web Parallel Text was developed by LDC and contains Chinese source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.
The data includes 88 source-translation document pairs, comprising 67,514 tokens of Chinese source text and its English translation.
GALE Phase 3 and 4 Chinese Web Parallel Text is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.