Linguistic Data Consortium: Data Pack

Showing posts with label Data Pack. Show all posts

Tuesday, August 18, 2015

LDC 2015 August Newsletter

Fall 2015 LDC Data Scholarship program - September 15 deadline approaching

LDC at Interspeech 2015

2013 Data Pack deadline is September 15

LDC co-organizes LSA2016 Pre-conference Workshop

New publications:

Arabic Learner Corpus

GALE Phase 3 Arabic Broadcast Conversation Speech Part 1

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1

Fall 2015 LDC Data Scholarship program - September 15 deadline approaching

Student applications for the Fall 2015 LDC Data Scholarship program are being accepted now through Tuesday, September 15, 2015, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Applicants can email their materials to the LDC Data Scholarship program.

LDC at Interspeech 2015

LDC will once again be exhibiting at Interspeech, held this year September 7-10 in Dresden, Germany. Stop by booth 20 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Monday 7 September, Poster Session 3-9, 11:00–13:00
Investigating Consonant Reduction in Mandarin Chinese with Improved Forced Alignment: Jiahong Yuan and Mark Liberman (both LDC)

Wednesday 9 September, Oral Session 36-5, 17:50-18:10
The Effect of Spectral Slope on Pitch Perception: Jianjing Kuang (UPenn) and Mark Liberman (LDC)

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

2013 Data Pack deadline is September 15

One month remains for not-for-profit and government organizations to create a custom data collection of eight corpora from among LDC’s 2013 releases. Selection options include: 1993-2007 United Nations Parallel Text, Chinese Treebank 8.0, CSC Deceptive Speech, GALE Arabic and Chinese speech and text releases, Greybeard, MADCAT training data, NIST 2012 Open Machine Translation (OpenMT) evaluation and progress sets, and more. The 2013 Data Pack is available for a flat rate of $3500 through September 15, 2015.

To license the Data Pack and select eight corpora, login or register for an LDC user account and add the 2013 Data Pack and each of the eight data sets to your bin. Follow the check-out procedure, sign all applicable user agreements and select payment via wire transfer, purchase order or check. LDC will adjust the invoice total to reflect the data pack fee.

To pay via credit card, add the 2013 Data Pack to your bin and check out using the system prompts. At the completion of the transaction, send an email to LDC indicating the eight data sets to include in your order.

LDC co-organizes LSA2016 Pre-conference Workshop

University of Arizona’s Malcah Yeager-Dror and LDC’s Chris Cieri are organizing the upcoming LSA 2016 workshop “Preparing your Corpus for Archival Storage”. The session is sponsored by the National Science Foundation (BCS #1549994) and will be held on Thursday, January 7, 2016 in Washington, DC before the start of the 90th Annual Meeting of the Linguistic Society of America (LSA 2016).

The workshop will examine critical factors which must be considered when preparing data for comparison and sharing following on the topics discussed in the LSA 2012 workshop, "Coding for Sociolinguistic Archive Preparation". Invited speakers will discuss specific coding conventions for such factors as socioeconomic and educational speaker demographics, language choice, stance and footing.

There will be no additional registration fees to attend the session for those already taking part in the annual meeting. Students who are about to carry out their own fieldwork, or who have begun doing so, are eligible to apply for funding by November 2, 2015 to help defray the extra costs for attending the workshop. For more information about the speakers and topics, visit LDC’s workshop page.

New publications

(1) Arabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students from 67 nationalities studying at pre-university and university levels. The average length of an essay is 178 words.

Two tasks were used to collect the written data, and participants had the choice to do one or both of them. In each of those tasks, learners were asked to write a narrative about a vacation trip and a discussion about the participant's study interest. Those choosing the first task generated a 40 minute timed essay without the use of any language reference materials. In the second task, participants completed the writing as a take-home assignment over two days and were permitted to use language reference materials.

The audio recordings were developed by allowing students a limited amount of time to talk about the topics above without using language reference materials.

The original handwritten essays were transcribed into an electronic text format. The corpus data consists of three types: (1) handwritten sheets scanned in PDF format; (2) audio recordings in MP3 format; and (3) textual unicode data in plain text and XML formats (including the transcribed audio and transcripts of the handwritten essays). The audio files are either 44100Hz 2-channel or 16000Hz 1-channel mp3 files.

Arabic Learner Corpus is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus provided that they have completed the license agreement. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 was developed by LDC and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 (LDC2015T16).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Aljazeera, Al Ordiniyah, Dubai TV, Lebanese Broadcasting Corporation, Oman TV, Saudi TV, and Syria TV.

This release contains 149 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM.

GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 was developed by LDC and contains transcriptions of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 (LDC2015S11).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 733,233 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, April 20, 2015

LDC 2015 April Newsletter

2013 Data Pack available through September 15

LDC supports NSF data management plans

New publications:

GALE Phase 3 and 4 Arabic Broadcast News Parallel Text

Mandarin Chinese Phonetic Segmentation and Tone

The Subglottal Resonances Database

________________________________________________________________________

2013 Data Pack available through September 15

Not-for-profit and government organizations can now create a custom data collection from among LDC’s 2013 releases. The 2013 Data Pack allows users to license eight corpora published in 2013 for a flat rate of US$3500. Selection options include Greybeard, NIST 2012 Open Machine Translation (OpenMT) evaluation and progress sets, Chinese Treebank 8.0, GALE Arabic and Chinese speech and text releases, 1993-2007 United Nations Parallel Text, MADCAT training data, CSC Deceptive Speech and more. Organizations acquire perpetual rights to the corpora licensed through the pack. The Data Pack is not a membership, and organizations must request all eight data sets at the time of purchase. The 2013 Data Pack is available to not-for-profit and government organizations for a limited time only, through September 15.

To pay via credit card, add the 2013 Data Pack to your bin and check out using the system prompts. At the completion of the transaction, send an email to ldc@ldc.upenn.edu indicating the eight data sets to include in your order.

As always, users can contact ldc@ldc.upenn.edu to facilitate the transaction.

LDC supports NSF data management plans

This month’s publication of The Subglottal Resonances Database is the latest in a series of releases of data developed with National Science Foundation (NSF) funding. Long before researchers were required to develop data management plans, they deposited their research data at LDC in accordance with NSF’s longstanding desire that data generated with program funds should be readily accessible at a reasonable cost. Well known data sets in the series include The Santa Barbara Corpus of Spoken American English (multiple parts), Propbank and Grassfields Bantu Fieldwork.

NSF now requires researchers to deposit funded data in an accessible, trustworthy archive. LDC’s expertise in data curation, distribution and management and its commitment to the broad accessibility of linguistic data make it the repository of choice for NSF-funded data. Learn more about how LDC can assist in developing and implementing data management plans from the Data Management Plans section on our website or contact LDC Data Management Plans.

The Subglottal Resonances Database was developed with the support of NSF Grant No. 0905250. It is available to LDC members at no cost; non-members may license the data set for a fee of $30 plus shipping.

New publications

(1) GALE Phase 3 and 4 Arabic Broadcast News Parallel Text includes 86 source-translation document pairs, comprising 325,538 words of Arabic source text and its English translation. Data is drawn from 28 distinct Arabic programs broadcast between 2007 and 2008 from Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya, Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiya, Dubai TV, Kuwait TV, Lebanese Broadcasting Corporation, Oman TV, Radio Sawa, Saudi TV, and Syria TV. Broadcast news programming consists of news programs focusing principally on current events.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.

GALE Phase 3 and 4 Arabic Broadcast News Parallel Text is distributed via web download. 2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Mandarin Chinese Phonetic Segmentation and Tone was developed by LDC and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. The ability to use large speech corpora for research in phonetics, sociolinguistics and psychology, among other fields, depends on the availability of phonetic segmentation and transcriptions. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese. Using the approach of embedded tone modeling (also used for incorporating tones for automatic speech recognition), the performance on forced alignment between tone-dependent and tone-independent models was compared.

Utterances were considered as the time-stamped between-pause units in the transcribed news recordings. Those with background noise, music, unidentified speakers and accented speakers were excluded. A test set was developed with 300 utterances randomly selected from six speakers (50 utterances for each speaker). The remaining 7,549 utterances formed a training set.

The utterances in the test set were manually labeled and segmented into initials and finals in Pinyin, a Roman alphabet system for transcribing Chinese characters. Tones were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral tone. The Sandhi Tone3 was labeled as Tone2. The training set was automatically segmented and transcribed using the LDC forced aligner, which is a Hidden Markov Model (HMM) aligner trained on the same utterances (Yuan et al. 2014). The aligner achieved 93.1% agreement (of phone boundaries) within 20 ms on the test set compared to manual segmentation. The quality of the phonetic transcription and tone labels of the training set was evaluated by checking 100 utterances randomly selected from it. The 100 utterances contained 1,252 syllables: 15 syllables had mistaken tone transcriptions; two syllables showed mistaken transcriptions of the final, and there were no syllables with transcription errors on the initial.

Each utterance has three associated files: a flac compressed wav file, a word transcript file, and a phonetic boundaries and label file.

Mandarin Chinese Phonetic Segmentation and Tone is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc, provided that they have submitted a completed copy of the user license agreement. 2015 Standard Members may request a copy as part of their 16 free membership corpora. As a members only release, Mandarin Chinese Phonetic Segmentation and Tone is not available for non-member licensing.

(3) The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age.

The subglottal system is composed of the airways of the tracheobronchial tree and the surrounding tissues. It powers airflow through the larynx and vocal tract, allowing for the generation of most of the sound sources used in languages around the world. The subglottal resonances (SGRs) are the natural frequencies of the subglottal system. During speech, the subglottal system is acoustically coupled to the vocal tract via the larynx. SGRs can be measured from recordings of the vibration of the skin of the neck during phonation by an accelerometer, much like speech formants are measured through microphone recordings. SGRs have received attention in studies of speech production, perception and technology. They affect voice production, divide vowels and consonants into discrete categories, affect vowel perception and can be useful in automatic speech recognition.

Speakers were recruited by Washington University's Psychology Department. The majority of the participants were Washington University students who represented a wide range of American English dialects, although most were speakers of the mid-American English dialect. The corpus consists of 35 monosyllables in a phonetically neutral carrier phrase (“I said a ____ again”), with 10 repetitions of each word by each speaker, resulting in 17,500 individual microphone (and accelerometer) waveforms. The monosyllables were comprised of 14 hVd words and 21 CVb words where C was b,d, g and V included all AE monophthongs and diphthongs. The target vowel in each utterance was hand-labeled to indicate the start, stop, and steady-state parts of the vowel. For diphthongs, the steady-state refers to the diphthong nucleus which occurs early in the vowel.

Audio files are presented as single channel 16-bit flac compressed wav files with sample rates of 48kHz or 16kHz. Image files are bitmap image files and plain text is UTF-8.

The Subglottal Resonances Database is distributed on one USB drive.

2015 Subscription Members will automatically receive a copy of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.