Linguistic Data Consortium: Wall Street Journal

Showing posts with label Wall Street Journal. Show all posts

Friday, June 16, 2017

LDC June 2017 Newsletter

New publications:

Abstract Meaning Representation (AMR) Annotation Release 2.0

UCLA High-Speed Laryngeal Video and Audio

______________________________________________________

(1) Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

AMR captures “who is doing what to whom” in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

LDC also released Abstract Meaning Representation (AMR) Annotation Release 1.0 (LDC2014T12).

Abstract Meaning Representation (AMR) Annotation Release 2.0 is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) CHiME2 WSJ0 was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 166 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments.

CHiME2 WSJ0 reflects the medium vocabulary track of the CHiME2 Challenge. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text. Data is divided into training, development and test sets and includes baseline scoring, decoding and retraining tools.

CHiME2 WSJ0 is distributed via web download.

(3) UCLA High-Speed Laryngeal Video and Audio was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings form nine subjects collected between April 2012 and April 2013. Speakers were asked to sustain the vowel /i/ for approximately ten seconds while holding voice quality, fundamental frequency, and loudness as steady as possible.

In the field of speech production theory, data such as contained in this release may be used to study the relationship between vocal folds vibration and resulting voice quality.

None of the subjects had a history of a voice disorder. There was no native language requirement for recruiting subjects; participants were native speakers of various languages, including English, Mandarin Chinese, Taiwanese Mandarin, Cantonese and German.

UCLA High-Speed Laryngeal Video and Audio is distributed via hard drive.

Thursday, January 19, 2017

LDC January 2017 Newsletter

LDC Membership Discounts for MY2017 Still Available

New publications:

Arabic Speech Recognition Pronunciation Dictionary

IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7

MWE-Aware English Dependency Corpus

GALE Phase 3 and 4 Chinese Web Parallel Text

___________________________________________________________________

LDC Membership Discounts for MY2017 Still Available

Join LDC now while membership savings are still available. 2016 members receive a 10% discount when renewing before March 1, 2017, or a 5% discount when renewing any time in 2017. Non-consecutive members and new members receive a 5% discount when renewing before March 1, 2017. Membership remains the most economical way to access LDC releases. This year’s planned publications include 2010 NIST Speaker Recognition Evaluation data set, Multilanguage Conversational Telephone Speech, Noisy TIMIT, IARPA Babel Language Packs, RATS Keyword Spotting, BOLT parallel and word-aligned data in all languages and more. Browse the Members pages for details on membership options and benefits.

New Corpora

(1) Arabic Speech Recognition Pronunciation Dictionary was developed by the Qatar Computing Research Institute. It contains approximately two million pronunciation entries for 526,000 Modern Standard Arabic words, for an average of 3.84 pronunciations for each grapheme word. The dictionary was developed from news archive resources, including the Arabic news website Aljazeera.net. The selected words were those that occurred more than once in the news collection.

Arabic Speech Recognition Pronunciation Dictionary is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Vietnamese conversational and scripted telephone speech collected in 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Vietnamese speech in this release represents that spoken in the North, North-Central, Central and Southern dialect regions in Vietnam. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are encoded in UTF-8.

IARPA Babel Vietnamese Language Pack IARPA-babel107b-v0.7 is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee..

(3) MWE-Aware English Dependency Corpus was developed by the Nara Institute of Science and Technology Computational Linguistics Laboratory and consists of English compound function words annotated in dependency format. The data is derived from the Wall Street Journal portion of OntoNotes Release 5.0 (LDC2013T19).

Compound function words are a type of multiword expression (MWE). MWEs are groups of tokens that can be treated as a single semantic or syntactic unit. Doing so facilitates natural language processing tasks such as constituency and dependency parsing.

MWE-Aware English Dependency Corpus is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee..

(4) GALE Phase 3 and 4 Chinese Web Parallel Text was developed by LDC and contains Chinese source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.

The data includes 88 source-translation document pairs, comprising 67,514 tokens of Chinese source text and its English translation.

GALE Phase 3 and 4 Chinese Web Parallel Text is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, April 15, 2014

LDC April 2014 Newsletter

New publications:

Domain-Specific Hyponym Relations
GALE Arabic-English Parallel Aligned Treebank -- WebTraining
Multi-Channel WSJ Audio

(1) Domain-Specific Hyponym Relations was developed by the Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Technology at Xi’an Jiaotung University, Xi’an, Shaanxi, China. It provides more than 5,000 English hyponym relations in five domains including data mining, computer networks, data structures, Euclidean geometry and microbiology. All hypernym and hyponym words were taken from Wikipedia article titles.

A hyponym relation is a word sense relation that is an IS-A relation. For example, dog is a hyponym of animal and binary tree is a hyponym of tree structure. Among the applications for domain-specific hyponym relations are taxonomy and ontology learning, query result organization in a faceted search and knowledge organization and automated reasoning in knowledge-rich applications.

The data is presented in XML format, and each file provides hyponym relations in one domain. Within each file, the term, Wikipedia URL, hyponym relation and the names of the hyponym and hypernym words are included. The distribution of terms and relations is set forth in the table below:

Dataset	Terms	Hyponym Relations
Data Mining	278	364
Computer Network	336	399
Data Structure	315	578
Euclidean Geometry	455	690
Microbiology	1,028	3,533

Domain-Specific Hyponym Relations is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. This data is made available at no-cost to LDC members and non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license.

(2) GALE Arabic-English Parallel Aligned Treebank -- Web Training was developed by LDC and contains 69,766 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned.

LDC previously released Arabic-English Parallel Aligned Treebanks as follows:

This release consists of Arabic source web data (newsgroups, weblogs) collected by LDC in 2004 and 2005. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language	Files	Words	Tokens	Segments
Arabic	162	46,710	69,766	3,178

Note: Word count is based on the untokenized Arabic source, token count is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:

Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect)
Identifying sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented segments, segments with foreign languages
Tagging unmatched words attached to other words or phrases

GALE Arabic-English Parallel Aligned Treebank -- Web Training is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Multi-Channel WSJ Audio (MCWSJ) was developed by the Centre for Speech Technology Research at the University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker.

This corpus was designed to address the challenges of speech recognition in meetings, which often occur in rooms with non-ideal acoustic conditions and significant background noise, and may contain large sections of overlapping speech. Using headset microphones represents one approach, but meeting participants may be reluctant to wear them. Microphone arrays are another option. MCWSJ supports research in large vocabulary tasks using microphone arrays. The news sentences read by speakers are taken from WSJCAM0 Cambridge Read News, a corpus originally developed for large vocabulary continuous speech recognition experiments, which in turn was based on CSR-I (WSJ0) Complete, made available by LDC to support large vocabulary continuous speech recognition initiatives.

Speakers reading news text from prompts were recorded using a headset microphone, a lapel microphone and an eight-channel microphone array. In the single speaker scenario, participants read from six fixed positions. Fixed positions were assigned for the entire recording in the overlapping scenario. For the moving scenario, participants moved from one position to the next while reading.

Fifteen speakers were recorded for the single scenario, nine pairs for the overlapping scenario and nine individuals for the moving scenario. Each read approximately 90 sentences.

Multi-Channel WSJ Audio is distributed on 2 DVD-ROM.

2014 Subscription Members will receive a copy of this data provided that they have completed the User License Agreement for Multi-Channel WSJ Audio LDC2014S03. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.