Tuesday, April 15, 2014

LDC April 2014 Newsletter



(1) Domain-Specific Hyponym Relations was developed by the Shaanxi Province Key Laboratory of Satellite and Terrestrial Network Technology at Xi’an Jiaotung University, Xi’an, Shaanxi, China. It provides more than 5,000 English hyponym relations in five domains including data mining, computer networks, data structures, Euclidean geometry and microbiology. All hypernym and hyponym words were taken from Wikipedia article titles.

A hyponym relation is a word sense relation that is an IS-A relation. For example, dog is a hyponym of animal and binary tree is a hyponym of tree structure. Among the applications for domain-specific hyponym relations are taxonomy and ontology learning, query result organization in a faceted search and knowledge organization and automated reasoning in knowledge-rich applications.

The data is presented in XML format, and each file provides hyponym relations in one domain. Within each file, the term, Wikipedia URL, hyponym relation and the names of the hyponym and hypernym words are included. The distribution of terms and relations is set forth in the table below:

Dataset
Terms
Hyponym Relations
Data Mining
278
364
Computer Network
336
399
Data Structure
315
578
Euclidean Geometry
455
690
Microbiology
1,028
3,533

Domain-Specific Hyponym Relations is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  This data is made available at no-cost to LDC members and non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license. 


*

(2) GALE Arabic-English Parallel Aligned Treebank -- Web Training was developed by LDC and contains 69,766 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.


Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned.

LDC previously released Arabic-English Parallel Aligned Treebanks as follows:
This release consists of Arabic source web data (newsgroups, weblogs) collected by LDC in 2004 and 2005. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language
Files
Words
Tokens
Segments
Arabic
162
46,710
69,766
3,178

Note: Word count is based on the untokenized Arabic source, token count is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:
  • Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect)
  • Identifying sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented segments, segments with foreign languages
  • Tagging unmatched words attached to other words or phrases

GALE Arabic-English Parallel Aligned Treebank -- Web Training is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) Multi-Channel WSJ Audio (MCWSJ) was developed by the Centre for Speech Technology Research at the University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker.

This corpus was designed to address the challenges of speech recognition in meetings, which often occur in rooms with non-ideal acoustic conditions and significant background noise, and may contain large sections of overlapping speech. Using headset microphones represents one approach, but meeting participants may be reluctant to wear them. Microphone arrays are another option. MCWSJ supports research in large vocabulary tasks using microphone arrays. The news sentences read by speakers are taken from WSJCAM0 Cambridge Read News, a corpus originally developed for large vocabulary continuous speech recognition experiments, which in turn was based on CSR-I (WSJ0) Complete, made available by LDC to support large vocabulary continuous speech recognition initiatives.

Speakers reading news text from prompts were recorded using a headset microphone, a lapel microphone and an eight-channel microphone array. In the single speaker scenario, participants read from six fixed positions. Fixed positions were assigned for the entire recording in the overlapping scenario. For the moving scenario, participants moved from one position to the next while reading.

Fifteen speakers were recorded for the single scenario, nine pairs for the overlapping scenario and nine individuals for the moving scenario. Each read approximately 90 sentences.

Multi-Channel WSJ Audio is distributed on 2 DVD-ROM.

2014 Subscription Members will receive a copy of this data provided that they have completed the User License Agreement for Multi-Channel WSJ Audio LDC2014S03. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Monday, March 17, 2014

LDC March 2014 Newsletter

New publications:

GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web  GALE Phase 2 Chinese Broadcast News Parallel Text Part 1  
USC-SFI MALACH Interviews and Transcripts Czech  


(1) GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web was developed by LDC and contains 344,680 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source newswire and web data collected by LDC in 2006 - 2008. The distribution by genre, words, character tokens and segments appears below:



Language

Genre

Docs

Words

CharTokens

Segments

Arabic

WB

119

59,696

81,620

4,383

Arabic

NW

717

198,621

263,060

8,423

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:
  • Normalizing tokenized tokens as needed
  • Identifying different types of links
  • Identifying sentence segments not suitable for annotation
  • Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 2 Chinese Broadcast News Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast news (BN) data collected by LDC between 2005 and 2007 and transcribed by LDC or under its direction.
This release includes 30 source-translation document pairs, comprising 198,350 characters of translated material. Data is drawn from 11 distinct Chinese BN. The broadcast news recordings in this release focus principally on current events.
The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC’s Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.

GALE Phase 2 Chinese Broadcast News Parallel Text Part 1 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation.

Inspired by his experience making Schindler’s List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. Within several years, the Foundation’s Visual History Archive held nearly 52,000 video testimonies in 32 languages representing 56 countries. It is the largest archive of its kind in the world. In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education.

The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives. The focus was advancing the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak.


LDC has also released USC-SFI MALACH Interviews and Transcripts English (LDC2012S05).

The speech data in this release was collected beginning in 1994 under a wide variety of conditions ranging from quiet to noisy. Original interviews were recorded on Sony Beta SP tapes, then digitized into a 3 MB/s MPEG-1 stream with 128 kb/s (44 kHz) stereo audio. The sound files in this release are single channel FLAC compressed PCM WAV format at a sampling frequency of 16 kHz.

Approximately 570 of all USC-SFI collected interviews are in Czech and average approximately 2.25 hours each. The interviews sessions in this release are divided into a training set (400 interviews) and a test set (20 interviews). The first fifteen minutes of the second tape from each training interview (approximately 30 total minutes of speech) were transcribed in .trs format using Transcriber 1.5.1. The test interviews were transcribed completely. Thus the corpus consists of 229 hours of speech (186 hours of training material plus 43 hours of test data) with 143 hours transcribed (100 hours of training material plus 43 hours of test data). Certain interviews include speech from family members in addition to that of the subject and the interviewer. Accordingly, the corpus contains speech from more than 420 speakers, who are more or less equally distributed between males and females.

USC-SFI MALACH Interviews and Transcripts Czech is distributed on four DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data provided that they have submitted a completed copy of the User License Agreement for USC-SFI MALACH Interviews and Transcripts Czech (LDC2014S04)   2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Tuesday, February 18, 2014

LDC February 2014 Newsletter

Spring 2014 LDC Data Scholarship recipients
Membership fee savings and publications pipeline
New LDC website enhancements coming soon

New publications:

Spring 2014 LDC Data Scholarship recipients
LDC is pleased to announce the student recipients of the Spring 2014 LDC Data Scholarship program!  This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen two proposals to support. The following students will receive no-cost copies of LDC data:
  • Skye Anderson ~ Tulane University (USA), BA candidate, Linguistics.  Skye has been awarded a copy of LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 for her work in author profiling.

  • Hao Liu ~ University College London (UK), PhD candidate, Speech, Hearing and Phonetic Sciences.  Hao has been awarded a copy of Switchboard-1 Release 2, and NXT Switchboard Annotations for his work in prosody modeling.

Membership fee savings and publications pipeline
Members can still save on 2014 membership fees, but time is running out. Any organization which joins or renews membership for 2014 through Monday, March 3, 2014, is entitled to a 5% discount. Organizations which held membership for MY2013 can receive a 10% discount on fees provided they renew prior to March 3, 2014.

Planned publications for this year include:
  • 2009 NIST Language Recognition Evaluation ~  development data from VOA broadcast and CTS telephone speech in target and non-target languages.
  • ETS Corpus of Non-Native Written English ~ contains 1100 essays written for a college-entrance test sampled from eight prompts (i.e., topics)  with score levels (low/medium/high) for each essay.
  • GALE data ~ including Word Alignment, Broadcast Speech & Transcripts, Parallel Text, Parallel Aligned Treebanks in Arabic, Chinese, and English.

  • Hispanic Accented English ~ contains approximately 30 hours of spontaneous speech and read utterances from non-native speakers of English with corresponding transcripts.
  • Multi-Channel Wall Street Journal Audio-Visual Corpus (MC-WSJ-AV) ~  re-recording of parts of the WSJCAM0 using a number of microphones as well as three recording conditions resulting in 18-20 channels of audio per recording.
  • TAC KBP Reference Knowledge Base ~ TAC KBP aims to develop and evaluate technologies for building and populating knowledge bases (KBs) about named entities from unstructured text.  KBP systems must either populate an existing reference KB, or else build a KB from scratch. The reference KB for is based on a snapshot of English Wikipedia snapshot from October 2008 and contains a set of entities, each with a canonical name and title for the Wikipedia page, an entity type, an automatically parsed version of the data from the infobox in the entity's Wikipedia article, and a stripped version of the text of the Wiki article.
  • USC-SFI MALACH Interviews and Transcripts Czech ~ developed by The University of Southern California's Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 143 hours of interviews from 420 interviewees along with transcripts and other documentation.

New LDC website enhancements coming soon
Look for LDC’s new website enhancements in the coming weeks. We've revamped our membership services to make it easier than ever for you to manage your membership and access data more quickly.


New publications
(1) GALEArabic-English Parallel Aligned Treebank -- Broadcast News Part 2 was developed by LDC and contains 141,058 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this corpus corresponds to a portion of the Arabic treebanked data in Arabic Treebank - Broadcast News v1.0 (LDC2012T07).

The source data consists of Arabic broadcast news programming collected by LDC in 2007 and 2008. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language
Files
Words
Tokens
Segments
Arabic
31
110,690
141,058
7,102

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:
  • Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect)
  • Identifying sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented segments, segments with foreign languages
  • Tagging unmatched words attached to other words or phrases
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) King Saud University Arabic Speech Database was developed by King Saud University and contains 590 hours of recorded Arabic speech from male and female speakers. The utterances include read and spontaneous speech. The recordings were conducted in varied environments representing quiet and noisy settings.

The corpus was designed principally for speaker recognition research. The speech sources are sentences, word lists, prose and question and answer sessions. Read speech text includes the following:
  • Sets of sentences devised to cover allophones of each phoneme, phonetic balance, and differentiation of accents.
  • Word lists developed to minimize missing phonemes and to represent nasals fricatives, commonly used words, and numbers.
  • Two paragraphs, one from the Quran and another from a book, selected because they included all letters of the alphabet and were easy to read.
Spontaneous speech was captured through question and answer sessions between participants and project team members. Speakers responded to questions on general topics such as the weather and food.

Each speaker was recorded in three different environments: a sound proof room, an office, and a cafeteria. The recordings were collected via microphone and mobile phone and averaged between 16-19 minutes. The data was verified for missing recordings, problems with the recording system or errors in the recording process.

King Saud University Arabic Speech Database is distributed on one hard disk.
2014 Subscription Members will receive a copy of this data provided that they have completed the User License Agreement. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) NIST2012 Open Machine Translation (OpenMT) Progress Test Five Language Source was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT 2012 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. The set is based on a subset of the Arabic-to-English and Chinese-to-English progress tests from the OpenMT 2008, 2009 and 2012 evaluations with new source data created by humans based on the English reference translation. The package was compiled, and scoring software was developed, at NIST, making use of newswire and web data and reference translations developed by the Linguistic Data Consortium  and the Defense Language Institute Foreign Language Center.

The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The 2012 task included the evaluation of five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English in two source data styles. For general information about the NIST OpenMT evaluations, refer to the NIST OpenMT website.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

This release consists of 20 files, four for each of the five languages, presented in XML with an included DTD. The four files are source and reference data in the following two styles:
  • English-true: an English-oriented translation this requires that the text read well and not use any idiomatic expressions in the foreign language to convey meaning, unless absolutely necessary.
  • Foreign-true: a translation as close as possible to the foreign language, as if the text had originated in that language.
NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Wednesday, January 15, 2014

LDC January 2014 Newsletter


New publications:


LDC Membership Discounts for MY 2014 Still Available

If you are considering joining LDC for Membership Year 2014 (MY2014), there is still time to save on membership fees. Any organization which joins or renews membership for 2014 through Monday, March 3, 2014, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2013 can receive a 10% discount on fees provided they renew prior to March 3, 2014.  For further information on pricing, please view our Invitation to Join for Membership Year 2014 announcement or contact LDC.

New Publications

(1) CALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The calls were recorded in 1995 and 1996 as part of the CALLFRIEND collection, a project designed primarily to support research in automatic language identification. One hundred native Farsi speakers living in the continental United States each made a single telephone call, lasting up to 30 minutes, to a family member or friend living in the United States.

This release represents all calls from the collection. LDC released recordings from 60 calls without transcripts in 1996 as CALLFRIEND Farsi (LDC96S50) after 20 of those calls were used as evaluation data in the first NIST Language Recognition Evaluation (LRE).

Corresponding transcripts are available in CALLFRIEND Farsi Second Edition Transcripts (LDC2014T01).

All recordings involved domestic calls routed through LDC’s automated telephone collection platform and were stored as 2-channel (4-wire), 8-KHz mu-law samples taken directly from the public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data.

This release includes speaker information, including gender, the number of speakers on each channel and call duration.

CALLFRIEND Farsi Second Edition Speech is distributed on one DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) CALLFRIEND Farsi Second Edition Transcripts was developed by LDC and consists of transcripts for approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The calls were recorded in 1995 and 1996 as part of the CALLFRIEND collection, a project designed primarily to support research in automatic language identification. One hundred native Farsi speakers living in the continental United States made a single telephone call, lasting up to 30 minutes, to a family member or friend living in the United States.

Corresponding speech data is available as CALLFRIEND Farsi Second Edition Speech (LDC2014S01).

Transcripts are presented in three formats: romanized transcripts (*asc.txt), Arabic-script transcripts (*ntv.txt) and both romanized and Arabic forms in a simple XML format (*.xml). For the *.txt files, the four main fields on each line (start-offset, end-offset, speaker-label, transcript-text) are separated by tabs. Each file begins with a single comment line containing the file_id string. This is followed immediately by the list of time-stamped segments, in order according to their start-offset values, with no blank lines. The XML form of the transcripts contains both Arabicized and romanized forms for Farsi words.

CALLFRIEND Farsi Second Edition Transcripts is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

Tuesday, December 17, 2013

LDC December 2013 Newsletter


Spring 2014 LDC Data Scholarship Program - deadline approaching
LDC to close for Winter Break

New publications:




Spring 2014 LDC Data Scholarship Program - deadline approaching 


The deadline for the Spring 2014 LDC Data Scholarship Program is right around the corner. Student applications are being accepted now through January 15, 2014, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser.  For further information on application materials and program rules, please visit the LDC Data Scholarship page.


Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.



LDC to close for Winter Break

LDC will be closed from Wednesday, December 25, 2013 through Wednesday, January 1, 2014 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2014. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.
Best wishes for a happy holiday season!


New publications


GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 was developed by LDC and contains 179,842 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.


Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation. 


This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2005 - 2007. 


The Chinese word alignment tasks consisted of the following components:

  • Identifying, aligning, and tagging 8 different types of links
  • Identifying, attaching, and tagging local-level unmatched words
  • Identifying and tagging sentence/discourse-level unmatched words
  • Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee

*

Maninkakan Lexicon was developed by LDC and contains 5,834 entries of the Maninkakan language presented as a Maninkakan-English lexicon and a Maninkakan-French lexicon. It is the second publication in an ongoing LDC project to to build an electronic dictionary of four Mandekan languages: Mawukakan, Maninkakan, Bambara and Jula. These are Eastern Manding languages in the Mande Group of the Niger-Congo language family. LDC released a Mawukakan Lexicon (LDC2005L01) in 2005.


More information about LDC’s work in the languages of West Africa and the challenges those languages present for language resource development can be found here.


Maninkakan is written using Latin script, Arabic script and the NKo alphabet. This lexicon is presented using a Latin-based transcription system because the Latin alphabet is familiar to the majority of Mandekan language speakers and because it is expected to facilitate the work of researchers interested in this resource.


The dictionary is provided in two formats, Toolbox and XML. Toolbox is a version of the widely used SIL Shoebox program adapted to display Unicode.  The Toolbox files are provided in two fonts, Arial and Doulous SIL. The Arial files should display using the Arial font which is standard on most operating systems. Doulous SIL, available as a free download, is a robust font that should display all characters without issue. Users should launch Toolbox using the *.prj files in the Arial or Doulous_SIL folders.


Maninkakan Lexicon is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

The ARRAU (Anaphora Resolution and Underspecification) Corpus of Anaphoric Information was developed by the University of Essex and the University of Trento. It contains annotations of multi-genre English texts for anaphoric relations with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. 


The source texts in this release include task-oriented dialogues from the TRAINS-91 and TRAINS-93 corpora (the latter released through LDC, TRAINS Spoken Dialog Corpus LDC95S25), narratives from the English Pear Stories, articles from the Wall Street Journal portions of the Penn Treebank (Treebank-2 LDC95T7) and the RST Discourse Treebank LDC2002T07,  and the Vieira/Poesio Corpus which consists of training and test files from Treebank-2 and RST Discourse Treebank.


The texts were annotated using the ARRAU guidelines which treat all noun phrases (NPs) as markables. Different semantic roles are recognized by distinguishing between referring expressions (that update or refer to a discourse model), and non-referring ones (including expletives, predicative expressions, quantifiers, and coordination). A variety of linguistic features were also annotated, including morphosyntactic agreement, grammatical function, semantic type (person, animate, concrete, action, time, other abstract) and genericity. The annotation was carried out using the MMAX2 annotation tool which allows text units to be marked at different levels. 


The files in MMAX format have been organized so that they can be visualized using the MMAX2 tool or directly used as input/output for the BART toolkit which performs automatic coreference resolution including all necessary preprocessing steps.


The ARRAU Corpus of Anaphoric Information is distributed via web download.

2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.
 


Monday, November 18, 2013

LDC November 2013 Newsletter



Invitation to Join for Membership Year 2014 
Spring 2014 LDC Data Scholarship Program
LDC to Close for Thanksgiving Break


        New publications:

Chinese Treebank 8.0 
CSC Deceptive Speech 



Invitation to Join for Membership Year (MY) 2014
 
Membership Year (MY) 2014 is open for joining. We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium. For MY2014, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase.  Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.

The details of our early renewal discounts for MY2014 are as follows:

·   Organizations who joined for MY2013 will receive a 5% discount when renewing. This discount will apply throughout 2014, regardless of time of renewal. MY2013 members renewing before Monday, March 3, 2014 will receive an additional 5% discount, for a total 10% discount off the membership fee.

·    New members as well as organizations who did not join for MY2013, but who held membership in any of the previous MYs (1993-2012), will also be eligible for a 5% discount provided that they join/renew before March 3, 2014.

Not-for-Profit/US Government

Standard US$2400 (MY 2014 Fee)
              US$2280 (with 5% discount)*
              US$2160 (with 10% discount)**

Subscription US$3850 (MY 2014 Fee)
                    US$3658 (with 5% discount)*
                    US$3465 (with 10% discount)**

For-Profit
Standard US$24000 (MY 2014 Fee)
               US$22800 (with 5% discount)*
               US$21600 (with 10% discount)**


Subscription US$27500 (MY 2014 Fee)
                    US$26125 (with 5% discount)*
                    US$24750 (with 10% discount)**

*  For new members, MY2013 Members renewing for MY2014, and any previous year Member who renews before March 3, 2014

** For MY2013 Members renewing before March 3, 2014

Publications for MY2014 are still being planned; here are the working titles of data sets we intend to provide:


2009 NIST Language Recognition Evaluation
Callfriend Farsi Speech and Transcripts
GALE data -- all phases and genres
Hispanic-English Speech
MADCAT Phase 4 Training
MALACH Czech ASR
NIST OpenMT Five Language Progress Set

In addition to receiving new publications, current year members of  LDC also enjoy the benefit of licensing older data at reduced costs; current year for-profit members may use most data for commercial applications.


Spring 2014 LDC Data Scholarship Program

Applications are now being accepted through Wednesday, January 15, 2014, 11:59PM EST for the Spring 20143 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 35 individual students and student research groups.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC  Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two datasets; students may apply for additional datasets during the following cycle once they have completed processing of the initial datasets and publish or present work in some juried venue.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full Non-member Fee for the data or to join the Consortium.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Spring 2014 program cycle is January 15, 2014, 11:59PM EST.

LDC to Close for Thanksgiving Break

LDC will be closed on Thursday, November 28, 2013 and Friday, November 29, 2013 in observance of the US Thanksgiving Holiday.  Our offices will reopen on Monday, December 2, 2013.

New publications

Chinese Treebank 8.0 consists of approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups and weblogs.

The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and then moved to Brandeis University. The project’s goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T08), released in 2010, added new annotated newswire data, broadcast material and web text to the approximate total of one million words. Chinese Treebank 8.0 adds new annotated data from newswire, magazine articles and government documents.

There are 3,007 text files in this release, containing 71,369 sentences, 1,620,561 words, 2,589,848 characters (hanzi or foreign). The data is provided in UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the  segmentation, POS-tagging and bracketing guidelines included in the release. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked.

Chinese Treebank 8.0 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc.2013 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


*


CSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interview from 32 native speakers of Standard American English (16 male, 16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus. 

The participants were told that they were participating in a communication experiment which sought to identify people who fit the profile of the top entrepreneurs in America. To this end, the participants performed tasks and answered questions in six areas. Tthey were later told that they had received low scores in some of those areas and did not fit the profile. The subjects then participated in an interview where they were told to convince the interviewer that they had actually achieved high scores in all areas and that they did indeed fit the profile. The task of the interviewer was to determine how he thought the subjects had actually performed, and he was allowed to ask them any questions other than those that were part of the performed tasks. For each question from the interviewer, subjects were asked to indicate whether the reply was true or contained any false information by pressing one of two pedals hidden from the interviewer under a table.

Interviews were conducted in a double-walled sound booth and recorded to digital audio tape on two channels using Crown CM311A Differoid headworn close-talking microphones, then down sampled to 16kHz before processing. 

The interviews were orthographically transcribed by hand using the NIST EARS transcription guidelines. Labels for local lies were obtained automatically from the pedal-press data and hand-corrected for alignment, and labels for global lies were annotated during transcription based on the known scores of the subjects versus their reported scores. The orthographic transcription was force-aligned using the SRI telephone speech recognizer adapted for full-bandwidth recordings. There are several segmentations associated with the corpus: the implicit segmentation of the pedal presses, derived semi-automatically sentence-like units (EARS SLASH-UNITS or SUs) which were hand labeled, intonational phrase units and the units corresponding to each topic of the interview.

CSC Deceptive Speech is distributed on 1 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data  provided they have completed and returned the User License Agreement for CSC Deceptive Speech (LDC2013S09). 2013 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.