Friday, May 15, 2015

LDC 2015 May Newsletter

Early renewing members save again

Commercial use and LDC data

New publications:

Early renewing members save again

LDC's early renewal discount program has resulted in substantial savings for current year members. The 110 organizations that renewed their membership or joined early for Membership Year 2015 (MY2015) saved over US$65,000 on membership fees. MY2014 members are still eligible for a 5% discount when renewing through 2015.

LDC membership benefits include free membership year data as well as discounts on older corpora. For-profit members can use most LDC data for commercial applications. 

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases.  Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose.  LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information,

New publications

(1) Coordination Annotation for the Penn Treebank is a stand-off annotation for the Wall Street Journal portion of Treebank-3 (PTB3) (LDC99T42) developed by researchers at the University of Düsseldorf and Indiana University. It marks all tokens that have a coordinating function (potentially among other functions).

Coordination is a syntactic structure that links together two or more elements known as conjuncts or conjoins. The presence of coordination is often signaled by the appearance of a coordinator (coordinating conjunction), such as and, or, but in English.

This annotation is presented in a single UTF-8 plain text tsv file with columns as follows:
section: Penn Treebank WSJ section number
file: Number of file within section
sentence: Number of sentence (starting with 0)
token: Number of token (starting with 0)
annotation: "P" if the token is a coordinating punctuation, "O" otherwise
Coordination Annotation for the Penn Treebank is available at no cost to all licensees of PTB3 and appears in their download queue associated with LDC99T42 as penn_coordination_anno_LDC2015T08.tgz.

*

(2) GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 (LDC2015T09). 

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature interviews, call-in programs, and roundtable discussions focusing principally on current events from the following sources: Beijing TV, China Central TV, Hubei TV, Phoenix TV and Voice of America.

This release contains 209 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 is distributed on DVD.  2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 112 hours of Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding audio data is released as GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 (LDC2015S06). 

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,388,236 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 is distributed via web download.  2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(4) SenSem (Sentence Semantics) Lexicons was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida and the Universitat Oberta de Catalunya. It contains feature descriptions for approximately 1,300 Spanish verbs and 1,300 Catalan verbs in the SenSem Databank (LDC2015T02). GRIAL's work focuses on resources for applied linguistics, including lexicography, translation and natural language processing.

The verb features for each language consist of two groups: those codified manually, including definition, WordNet synset, Aktionsart, arguments and semantic functions; and those extracted automatically from the SenSem Databank. Among the latter are verb frequency, semantic construction, syntactic categories and constituent order. The verbs analyzed correspond to the 250 most frequent verbs in Spanish and 320 lemmas in Catalan. Further information about the SenSem project can be obtained from the GRIAL website. Data is presented in a single XML file per language.

SenSem Lexicons is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.  This data is made available to LDC not-for-profit members and all non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license and to LDC for-profit members under the terms of the For-Profit Membership Agreement.


Monday, April 20, 2015

LDC 2015 April Newsletter

2013 Data Pack available through September 15

LDC supports NSF data management plans

New publications:
________________________________________________________________________ 

2013 Data Pack available through September 15

Not-for-profit and government organizations can now create a custom data collection from among LDC’s 2013 releases. The 2013 Data Pack allows users to license eight corpora published in 2013 for a flat rate of US$3500. Selection options include Greybeard, NIST 2012 Open Machine Translation (OpenMT) evaluation and progress sets, Chinese Treebank 8.0, GALE Arabic and Chinese speech and text releases, 1993-2007 United Nations Parallel Text, MADCAT training data, CSC Deceptive Speech and more. Organizations acquire perpetual rights to the corpora licensed through the pack. The Data Pack is not a membership, and organizations must request all eight data sets at the time of purchase. The 2013 Data Pack is available to not-for-profit and government organizations for a limited time only, through September 15.

To license the Data Pack and select eight corpora, login or register for an LDC user account and add the 2013 Data Pack and each of the eight data sets to your bin. Follow the check-out procedure, sign all applicable user agreements and select payment via wire transfer, purchase order or check. LDC will adjust the invoice total to reflect the data pack fee.

To pay via credit card, add the 2013 Data Pack to your bin and check out using the system prompts. At the completion of the transaction, send an email to ldc@ldc.upenn.edu indicating the eight data sets to include in your order.

As always, users can contact ldc@ldc.upenn.edu to facilitate the transaction.   


LDC supports NSF data management plans

This month’s publication of The Subglottal Resonances Database is the latest in a series of releases of data developed with National Science Foundation (NSF) funding. Long before researchers were required to develop data management plans, they deposited their research data at LDC in accordance with NSF’s longstanding desire that data generated with program funds should be readily accessible at a reasonable cost. Well known data sets in the series include The Santa Barbara Corpus of Spoken American English (multiple parts), Propbank and Grassfields Bantu Fieldwork.

NSF now requires researchers to deposit funded data in an accessible, trustworthy archive. LDC’s expertise in data curation, distribution and management and its commitment to the broad accessibility of linguistic data make it the repository of choice for NSF-funded data. Learn more about how LDC can assist in developing and implementing data management plans from the Data Management Plans section on our website or contact LDC Data Management Plans.

The Subglottal Resonances Database was developed with the support of NSF Grant No. 0905250. It is available to LDC members at no cost; non-members may license the data set for a fee of $30 plus shipping. 

New publications

(1) GALE Phase 3 and 4 Arabic Broadcast News Parallel Text includes 86 source-translation document pairs, comprising 325,538 words of Arabic source text and its English translation. Data is drawn from 28 distinct Arabic programs broadcast between 2007 and 2008 from Abu Dhabi TV,  Al Alam News Channel,  Al Arabiya, Al Baghdadya, Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiya, Dubai TV, Kuwait TV, Lebanese Broadcasting Corporation, Oman TV, Radio Sawa, Saudi TV,  and Syria TV. Broadcast news programming consists of news programs focusing principally on current events.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.

GALE Phase 3 and 4 Arabic Broadcast News Parallel Text is distributed via web download.  2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) Mandarin Chinese Phonetic Segmentation and Tone was developed by LDC and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. The ability to use large speech corpora for research in phonetics, sociolinguistics and psychology, among other fields, depends on the availability of phonetic segmentation and transcriptions. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese. Using the approach of embedded tone modeling (also used for incorporating tones for automatic speech recognition), the performance on forced alignment between tone-dependent and tone-independent models was compared.

Utterances were considered as the time-stamped between-pause units in the transcribed news recordings. Those with background noise, music, unidentified speakers and accented speakers were excluded. A test set was developed with 300 utterances randomly selected from six speakers (50 utterances for each speaker). The remaining 7,549 utterances formed a training set.

The utterances in the test set were manually labeled and segmented into initials and finals in Pinyin, a Roman alphabet system for transcribing Chinese characters. Tones were marked on the finals, including Tone1 through Tone4, and Tone0 for the neutral tone. The Sandhi Tone3 was labeled as Tone2. The training set was automatically segmented and transcribed using the LDC forced aligner, which is a Hidden Markov Model (HMM) aligner trained on the same utterances (Yuan et al. 2014). The aligner achieved 93.1% agreement (of phone boundaries) within 20 ms on the test set compared to manual segmentation. The quality of the phonetic transcription and tone labels of the training set was evaluated by checking 100 utterances randomly selected from it. The 100 utterances contained 1,252 syllables: 15 syllables had mistaken tone transcriptions; two syllables showed mistaken transcriptions of the final, and there were no syllables with transcription errors on the initial.

Each utterance has three associated files: a flac compressed wav file, a word transcript file, and a phonetic boundaries and label file.

Mandarin Chinese Phonetic Segmentation and Tone is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc, provided that they have submitted a completed copy of the user license agreement.  2015 Standard Members may request a copy as part of their 16 free membership corpora. As a members only release, Mandarin Chinese Phonetic Segmentation and Tone is not available for non-member licensing.

*

(3) The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age.

The subglottal system is composed of the airways of the tracheobronchial tree and the surrounding tissues. It powers airflow through the larynx and vocal tract, allowing for the generation of most of the sound sources used in languages around the world. The subglottal resonances (SGRs) are the natural frequencies of the subglottal system. During speech, the subglottal system is acoustically coupled to the vocal tract via the larynx. SGRs can be measured from recordings of the vibration of the skin of the neck during phonation by an accelerometer, much like speech formants are measured through microphone recordings. SGRs have received attention in studies of speech production, perception and technology. They affect voice production, divide vowels and consonants into discrete categories, affect vowel perception and can be useful in automatic speech recognition.

Speakers were recruited by Washington University's Psychology Department. The majority of the participants were Washington University students who represented a wide range of American English dialects, although most were speakers of the mid-American English dialect. The corpus consists of 35 monosyllables in a phonetically neutral carrier phrase (“I said a ____ again”), with 10 repetitions of each word by each speaker, resulting in 17,500 individual microphone (and accelerometer) waveforms. The monosyllables were comprised of 14 hVd words and 21 CVb words where C was b,d, g and V included all AE monophthongs and diphthongs. The target vowel in each utterance was hand-labeled to indicate the start, stop, and steady-state parts of the vowel. For diphthongs, the steady-state refers to the diphthong nucleus which occurs early in the vowel.

Audio files are presented as single channel 16-bit flac compressed wav files with sample rates of 48kHz or 16kHz. Image files are bitmap image files and plain text is UTF-8.

The Subglottal Resonances Database is distributed on one USB drive.

2015 Subscription Members will automatically receive a copy of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, March 16, 2015

LDC 2015 March Newsletter

Spring 2015 LDC Data Scholarship recipients

2001 HUB5 English Evaluation update

New publications:
_________________________________________________________________________

Spring 2015 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2015 data scholarships:

Christopher Kotfila ~ State University of New York, Albany (USA), PhD Candidate, Informatics. Christopher has been awarded copies of Message Understanding Conference and ACE 2005 SpatialML for his work in named entity extraction.  
  
Ilia Markov ~ National Polytechnic University (Mexico), PhD candidate, Computer Science. Ilia has been awarded a copy of the ETS Corpus of Non-Native Written English for his work in native language identification    

Matthew Nelson ~ Georgia State University (USA), MA candidate, Applied Linguistics. Matthew has been awarded a copy of TIMIT and Nationwide Speech for his work in speaker perception.   

Meladianos Polykarpos ~ Athens University of Economics and Business (Greece), PhD candidate, Informatics. Meladianos has been awarded a copy of TDT5 Text and Topics/Annotations for his work in information retrieval.  

Benjamin Schloss ~ Pennsylvania State University (USA), PhD candidate, Psychology. B
Benjamin has been awarded a copy of the ETS Corpus of Non-Native Written English for his work in semantics.

For program information visit the Data Scholarship page.

2001 HUB5 English Evaluation update
2001 HUB5 English Evaluation (LDC2002S13) now includes corresponding transcriptions.  The transcripts are available as part of the web download for this data.  Additionally, all HUB5 English catalog entries have been updated to reflect LDC's current standards for documentation and metadata.

New publications:

(1) GALE Chinese-English Parallel Aligned Treebank -- Training was developed by LDC and contains 229,249 tokens of word aligned Chinese and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

The Chinese source data was translated into English. Chinese and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this release corresponds to portions of the Chinese treebanked data in Chinese Treebank 6.0 (LDC2007T36) (CTB), OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).

This release consists of Chinese source broadcast programming (China Central TV, Phoenix TV), newswire (Xinhua News Agency) and web data collected by LDC. The distribution by genre, words, character tokens, treebank tokens and segments appears below:

Genre
   Files
   Words
    CharTokens
  CTBTokens
  Segments
bc
   10 
   57,571
    86,356
  60,270
  3,328
nw
   172
   64,337
    96,505
  57,722
  2,092
wb
   86
   30,925
    46,388
  31,240
  1,321
Total
   268
   152,833
    229,249
  149,232
  6,741

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment task consisted of the following components:
  • Identifying, aligning, and tagging eight different types of links
  • Identifying, attaching, and tagging local-level unmatched words
  • Identifying and tagging sentence/discourse-level unmatched words
  • Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link
This release contains nine types of files - Chinese raw source files, English raw translation files, Chinese character tokenized files, Chinese CTB tokenized files, English tokenized files, Chinese treebank files, English treebank files, character-based word alignment files, and CTB-based word alignment files.

GALE Chinese-English Parallel Aligned Treebank -- Training is distributed via web download.  2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text includes 55 source-translation document pairs, comprising 280,535 words of Arabic source text and its English translation. Data is drawn from 22 distinct Arabic programs broadcast between 2006 and 2008. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtables.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. The transcribed and segmented files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text is distributed via web download.  2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.

Code-switching refers to the practice of shifting between languages or language varieties during conversation. This corpus focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers. Speakers engaged in unscripted conversations and interviews. In the conversational speech segments, two speakers conversed freely with each other. The interviews consisted of questions from an interviewer and answers from an interviewee; only the interviewee's speech was recorded. Topics discussed range from hobbies, friends, and daily activities.

The speakers were gender-balanced (49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian.

The speech recordings were conducted in a quiet room using several microphones and recording devices. Details about the recording conditions are contained in the documentation provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings in flac compressed wav format between 20 and 120 minutes in length.
Selected segments of the audio recordings were transcribed. Most of those segments contain code-switching utterances.

Mandarin-English Code-Switching in South-East Asia is distributed on two DVD-ROM.  2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Tuesday, February 17, 2015

LDC 2015 February Newsletter


Only two weeks left to enjoy 2015 membership savings 

New publications:
Avocado Research Email Collection
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3
RATS Speech Activity Detection
_________________________________________________________________________

Only two weeks left to enjoy 2015 membership savings 

There’s still time to save on 2015 membership fees. Now through March 2, all organizations will receive a 5% discount when they join for MY2015. MY2014 members are eligible for an additional 5% off the fee when they renew before March 2.  

Don’t miss this savings opportunity. Secure your membership today for access to new corpora as well as discounts on our existing catalog of over 600 holdings. 2015 publications include the following:

  • CIEMPIESS - Mexican Spanish radio broadcast audio and transcripts     
  • GALE Phase 3 and 4 data – all tasks and languages
  • Mandarin Chinese Phonetic Segmentation and Tone Corpus - phonetic segmentation and tone labels  
  • RATS Speech Activity Detection  – multilanguage audio for robust speech detection and language identification
  • SEAME - Mandarin-English code-switching speech
To join, create or sign into your LDC user account, select your preferred membership type from the Catalog, add the item to your bin and follow the check-out process. The Membership Office will apply any discounts. Alternatively, if you have already received a renewal invoice from LDC, you can simply pay against that.

For more information on the benefits of membership, visit Join LDC


New publications

(1) Avocado Research Email Collection consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Leads", or system accounts such as "Conference Room Upper Canada".

The collection consists of the processed personal folders of these accounts with metadata describing folder structure, email characteristics and contacts, among others. It is expected to be useful for social network analysis, e-discovery and related fields.


The source data for the collection consisted of Personal Storage Table (PST) files for 282 accounts. A PST file is used by MS Outlook to store emails, calendar entries, contact details, and related information. Data was extracted from the PST files using libpst version 0.6.54. Three files produced no output and and are not included in the collection. Each account is referred to as a "custodian" although some of the accounts do not correspond to humans.

The collection is divided into metadata and text. The metadata is represented in XML, with a single top-level XML file listing the custodians, and then one XML file per custodian listing all items extracted from that custodian's PST files. The full XML tree can be read by loading the top-level file with an XML parser that handles directives. All XML metadata files are encoded in UTF-8. The text contains the extracted text of the items in the custodians' folders, with the extracted text for each item being held in a separate file. The text files are then zipped into a zip file per custodian.

Avocado Research Email Collection is distributed on 1 DVD-ROM. 2015 Subscription Members will automatically receive two copies of this corpus provided that they have completed the license agreement
.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 was developed by LDC and contains 242,020 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. 

Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:



Language

Genre

Files

Words

CharTokens

Segments

Chinese

BC

92

67,354

101,032

2,714

Chinese

BN

34

93,992

140,988

3,314

Total

 

126

161,346

242,020

6,028


Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

  • Identifying, aligning, and tagging eight different types of links
  • Identifying, attaching, and tagging local-level unmatched words
  • Identifying and tagging sentence/discourse-level unmatched words
  • Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 is distributed via web download. 2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) RATS Speech Activity Detection was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously. 

Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers.

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic, Farsi, Pashto and Urdu speakers; and (2) material from the Fisher English (LDC2004S13, LDC2005S13), and Fisher Levantine Arabic telephone studies (LDC2007S02), as well as from CALLFRIEND Farsi (LDC2014S01).

Annotation was performed in three steps. LDC's automatic speech activity detector was run against the audio data to produce a speech segmentation for each file. Manual first pass annotation was then performed as a quick correction of the automatic speech activity detection output. Finally, in a manual second pass annotation step, annotators reviewed first pass output and made adjustments to segments as needed.

All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers.

RATS Speech Activity Detection is distributed on 1 hard drive.  2015 Subscription Members will automatically receive one copy of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Tuesday, January 20, 2015

LDC 2015 January Newsletter

LDC Membership Discounts for MY 2015 Still Available

New publications:


LDC Membership Discounts for MY 2015 Still Available
If you are considering joining LDC for Membership Year 2015 (MY2015), there is still time to save on membership fees. Any organization which joins or renews membership for 2015 through Monday, March 2, 2015, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2014 can receive a 10% discount on fees provided they renew prior to March 2, 2015.  For further information on planned publications for MY2015, please visit or contact LDC.

New publications

GALE Phase 2 Arabic Broadcast News Speech Part 2 was developed by LDC and is comprised of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast News Transcripts Part 1 (LDC2015T01).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast recordings in this release feature news programs focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, based in Dubai, United Arab Emirates; Al Iraqiyah, a television network based in Iraq; Kuwait TV, a national television station based in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

This release contains 204 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 2 Arabic Broadcast News Speech Part 2 is distributed on 3 DVD-ROM.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

GALE Phase 2 Arabic Broadcast News Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. Corresponding audio data is released as GALE Phase 2 Arabic Broadcast News Speech Part 2 (LDC2015S01).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 920,730 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Arabic Broadcast News Transcripts Part 2 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

SenSem (Sentence Semantics) Databank was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida and the Universitat Oberta de Catalunya. It contains syntactic and semantic annotation for over 35,000 sentences, approximately one million words of Spanish and approximately 700,000 words of Catalan translated from the Spanish. GRIAL's work focuses on resources for applied linguistics, including lexicography, translation and natural language processing.

Each sentence in SenSem Databank was labeled according to the verb sense it exemplifies, the type of complement it takes (arguments or adjuncts) and the syntactic category and function. Each argument was also labeled with a semantic role. Further information about the SenSem project can be obtained from the GRIAL website.

The Spanish source data includes texts from news journals (30,000 sentences) and novels (5,299 sentences). Those sentences represent around 1,000 different verb meanings that correspond to the 250 most frequent Spanish verbs. Verb frequencies were retrieved from a quantitative analysis of around 13 million words.

The Catalan corpus was developed by translating the news journal portion of the Spanish data set, resulting in a resource of over 700,000 sentences from which 391,267 sentences were annotated. Sentences were automatically translated and manually post-edited; some were re-annotated for sentence complements. Semantic information was the same for both languages. The Catalan sentences represent close to 1,300 different verbs.

SenSem Databank is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee. This data is made available to LDC not-for-profit members and all non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license and to LDC for-profit members under the terms of the For-Profit Membership Agreement.

Monday, December 15, 2014

LDC 2014 December Newsletter

Renew your LDC membership today

Spring 2015 LDC Data Scholarship Program - deadline approaching

Reduced fees for Treebank-2 and Treebank-3 

LDC to close for Winter Break

New publications:

Renew your LDC membership today

Membership Year 2015 (MY2015) discounts are available for those who keep their membership current and  join early in the year. Check here for further information including our planned publications for MY2015.

Now is also a good time to consider joining LDC for the current and open membership years, MY2014 and MY2013. MY2014 offers members an impressive 37 publications which include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. MY2013 remains open through the end of the 2014 calendar year and its publications include Mixer 6 speech, Greybeard, UN parallel text and CSC Deceptive Speech as well as updates to Chinese Treebank and Chinese Proposition Bank. For full descriptions of these data sets, visit our Catalog.

Spring 2015 LDC Data Scholarship Program - deadline approaching
The deadline for the Spring 2015 LDC Data Scholarship Program is right around the corner! Student applications are being accepted now through January 15, 2015, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.

Reduced fees for Treebank-2 and Treebank-3
Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) are now available to non-members at reduced fees, US$1500 for Treebank-2 and US$1700 for Treebank-3, reductions of 52% and 47%, respectively.

LDC to close for Winter Break
LDC will be closed from December 25, 2014 through January 2, 2015 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on January 5, 2015. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.

Best wishes for a relaxing holiday season!

New publications

(1) Benchmarks for Open Relation Extraction was developed by the University of Alberta and contains annotations for approximately 14,000 sentences from The New York Times Annotated Corpus (LDC2008T19) and Treebank-3 (LDC99T42). This corpus was designed to contain benchmarks for the task of open relation extraction (ORE), along with sample extractions from ORE methods and evaluation scripts for computing a method's precision and recall.

ORE attempts to extract as many relations as described in a corpus without relying on relation-specific training data. The traditional approach to relation extraction requires substantial training effort for each relation of interest. That can be unpractical for massive collections such as found on the web. Open relation extraction offers an alternative by extracting unseen relations as they come. It does not require training data for any particular relation, making it suitable for applications that require a large (or even unknown) number of relations. Results published in ORE literature are often not comparable due to the lack of reusable annotations and differences in evaluation methodology. The goal of this benchmark data set is to provide annotations that are flexible and can be used to evaluate a wide range of methods.

Binary and n-ary relations were extracted from the text sources. Sentences were annotated for binary relations manually and automatically. In the manual sentence annotation, two entities and a trigger (a single token indicating a relation) were identified for the relation between them, if one existed. A window of tokens allowed to be in a relation was specified; that included modifiers of the trigger and prepositions connecting triggers to their arguments. For each sentence annotated with two entities, a system must extract a string representing the relation between them. The evaluation method deemed an extraction as correct if it contained the trigger and allowed tokens only. The automatic annotator identified pairs of entities and a trigger of the relation between them; the evaluation script for that experiment deemed an extraction correct if it contained the annotated trigger. For n-ary relations, sentences were annotated with one relation trigger and all of its arguments. An extracted argument was deemed correct if it was annotated in the sentence.

Benchmarks for Open Relation Extractions is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data provided they have completed a copy of the user agreement2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.
*

(2) Fisher and CALLHOME Spanish--English Speech Translation was developed at Johns Hopkins University and contains English reference translations and speech recognizer output (in various forms) that complement the LDC Fisher Spanish (LDC2010T04) and CALLHOME Spanish audio and transcript releases (LDC96T17). Together, they make a four-way parallel text dataset representing approximately 38 hours of speech, with defined training, development, and held-out test sets.

The source data are the Fisher Spanish and CALLOME Spanish corpora developed by LDC, comprising transcribed telephone conversations between (mostly native) Spanish speakers in a variety of dialects. The Fisher Spanish data set consists of 819 transcribed conversations on an assortment of provided topics primarily between strangers, resulting in approximately 160 hours of speech aligned at the utterance level, with 1.5 million tokens. The CALLHOME Spanish corpus comprises 120 transcripts of spontaneous conversations primarily between friends and family members, resulting in approximately 20 hours of speech aligned at the utterance level, with just over 200,000 words (tokens) of transcribed text.

Translations were obtained by crowdsourcing using Amazon's Mechanical Turk, after which the data was split into training, development, and test sets. The CALLHOME data set defines its own data splits, organized into train, devtest, and evltest, which were retained here. For the Fisher material, four data splits were produced: a large training section and three test sets. These test sets correspond to portions of the data where four translations exist.

Fisher and CALLHOME Spanish--English Speech Translation is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 was developed by LDC and is comprised of approximately 126 hours of Mandarin Chinese broadcast conversation speech collected in 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 (LDC2014T28).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program. HKUST collected Chinese broadcast programming using its internal recording system and a portable broadcast collection platform designed by LDC and installed at HKUST in 2006.

The broadcast conversation recordings in this release feature interviews, call-in programs, and roundtable discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Anhui Province, China; Beijing TV, a national television station in China; China Central TV (CCTV), a Chinese national and international broadcaster; Hubei TV, a regional broadcaster in Hubei Province, China; and Phoenix TV, a Hong Kong-based satellite television station.

This release contains 217 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 is distributed on 2 DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(4) GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 was developed by LDC and contains transcriptions of approximately 126 hours of Chinese broadcast conversation speech collected in 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 (LDC2014S09).

The source broadcast conversation recordings feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Anhui Province, China; Beijing TV, a national television station in China; China Central TV (CCTV), a Chinese national and international broadcaster; Hubei TV, a regional television station in Hubei Province, China; and Phoenix TV, a Hong Kong-based satellite television station.

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,556,904 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans .

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.