Tuesday, September 17, 2013

LDC September 2013 Newsletter


New LDC Website Coming Soon
LDC Spoken Language Sampler - 2nd Release

     New publications:

GALE Phase 2 Arabic Broadcast Conversation Speech Part 2
GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2
Semantic Textual Similarity (STS) 2013 Machine Translation



New LDC Website Coming Soon

Look for LDC's new website in the coming weeks. We've revamped the design and site plan to make it easier than ever to find what you're looking for. The features you use the most -- the catalog, new corpus releases and user login -- will be a short click away. We expect the LDC website to be occasionally unavailable for a few days at the end of September as we make the switch and thank you in advance for your understanding.
LDC Spoken Language Sampler - 2nd Release

The LDC Spoken Language Sampler – 2nd Release is now available.  It contains speech and transcript samples from recent releases and is available at no cost.  Follow the link above to the catalog page, download and browse.

New publications:

(1) GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program. The data was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program. 

LDC's local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular; all signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. 

The broadcast conversation recordings in this release feature interviews, call-in programs and round table discussions focusing principally on current events from several sources. This release contains 141 audio files presented in .wav, 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.
GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 is distributed on 2 DVD-ROM.

2013 Subscription Members will automatically receive two copies of this data.  2013 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data fora fee

*

(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. The source broadcast conversation recordings feature interviews, call-in programs and round table discussions focusing principally on current events from several sources. 

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 763,945 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. 

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript.
GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 2 is distributed via web download.
2013 Subscription Members will automatically receive two copies of this data on disc.  2013 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.



*

(3)  Semantic Textual Similarity (STS) 2013 Machine Translation was developed as part of the STS 2013 Shared Task which was held in conjunction with *SEM 2013, the second joint conference on lexical and computational semantics organized by the ACL (Association of Computational Linguistics) interest groups SIGLEX and SIGSEM. It is comprised of one text file containing 750 English sentence pairs translated from the Arabic and Chinese newswire and web data sources.

The goal of the Semantic Textual Similarity (STS) task was to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on natural language processing (NLP) applications. STS measures the degree of semantic equivalence. The STS task was proposed as an attempt at creating a unified framework that allows for an extrinsic evaluation of multiple semantic components that otherwise have historically tended to be evaluated independently and without characterization of impact on NLP applications. More information is available at the STS 2013 Shared Task homepage.

The source data is Arabic and Chinese newswire and web data collected by LDC that was translated and used in the DARPA GALE (Global Autonomous Language Exploitation) program and in several NIST Open Machine Translation evaluations. Of the 750 sentence pairs, 150 pairs are from the GALE Phase 5 collection and 600 pairs are from NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07).

The data was built to identify semantic textual similarity between two short text passages. The corpus is comprised of two tab delimited sentences per line. The first sentence is a translation and the second sentence is a post-edited translation. Post-editing is a process to improve machine translation with a minimum of manual labor. The gold standard similarity values and other STS datasets can be obtained from the STS homepage, linked above. 

Semantic Textual Similarity (STS) 2013 Machine Translation is distributed via web download.
2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may request this data by submitting a signed copy of LDC User Agreement for Non-members.  This data is available at no-cost.

Monday, August 19, 2013

LDC August 2013 Newsletter

Mixer 6 now available
Fall 2013 LDC Data Scholarship Program - deadline approaching! 
 LDC at Interspeech 2013, Lyon France

New publications: 

GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2
MADCAT Phase 3 Training Set    
Mixer 6 Speech  


Mixer 6 now available!


The release of Mixer 6 Speech this month marks the first time in close to a decade that LDC has made available a large-scale speech training data collection. Representing more than 15,000 hours of speech from over 500 speakers, Mixer 6 follows in the footsteps of the Switchboard and Fisher studies by providing a large database of rich telephone conversations with the addition of subject interviews and transcript readings. Participants were native American English speakers local to the Philadelphia area, providing further scope for a variety of research tasks. Mixer 6 Speech is a members-only release and a great reason to join the consortium. In addition to this substantial resource, members enjoy rights to other data released in 2013 and can license older publications at reduced fees. 

Please see the full description of Mixer 6 Speech.

Fall 2013 LDC Data Scholarship Program - deadline approaching!

The deadline for the Fall 2013 LDC Data Scholarship Program is one month away! Student applications are being accepted now through September 16, 2013, 11:59PM EST.  The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. 

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.  

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

LDC will once again be exhibiting at Interspeech held this year August 25-29 in Lyon. Please stop by LDC’s booth to to learn about recent developments at the Consortium, including new publications.

Also, be on the lookout for the following presentations:
  • Speech Activity Detection on YouTube Using Deep Neural Networks
    • Neville Ryant, Mark Liberman, Jiahong Yuan (all LDC)
    • Monday 26 August, Poster 6,  16.00 – 18.00
    • Room: Forum 6
  • The Spectral Dynamics of Vowels in Mandarin Chinese 
    • Jiahong Yuan (LDC)
    • Tuesday 27 August, Oral 17, 14.00 – 16.00 
    • Room: Gratte-Ciel 3 
  • Automatic Phonetic Segmentation using Boundary Models
    • Jiahong Yuan (LDC), Neville Ryant (LDC), Mark Liberman (LDC), Andreas Stolcke, Vikramjit Mitre, Wen Wang
    • Wednesday 28 August, Oral 32, 14.00 – 16.00
    • Room: Gratte-Ciel 3
LDC will continue to post conference updates via our Facebook page. We hope to see you there!

New publications:


(1) GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC in 2005-2007 and transcribed by LDC or under its direction.

This release includes 20 source-translation document pairs, comprising 152,894 characters of Chinese source text and its English translation. Data is drawn from six distinct Chinese programs broadcast in 2005-2007 from Phoenix TV, a Hong Kong-based satellite television station. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics.

The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phase 3 Training Set contains all training data created by LDC to support Phase 3 of the DARPA MADCAT Program. The data in this release consists of handwritten Arabic documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output. 

The goal of the MADCAT program is to automatically convert foreign text images into English transcripts. MADCAT Phase 3 data was collected from Arabic source documents in three genres: newswire, weblog and newsgroup text. Arabic speaking scribes copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple pages for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions. 

The handwritten, transcribed documents were next checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.

The final step was to produce a unified data format that takes multiple data streams and generates a single MADCAT XML output file which contains all required information. The resulting madcat.xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consists of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. This release includes 4,540 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and madcat.xml) along with their corresponding scanned image files in TIFF format.

MADCAT Phase 3 Training Set is distributed on one DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Mixer 6 Speech was developed by LDC and is comprised of 15,863 hours of telephone speech, interviews and transcript readings from 594 distinct native English speakers. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area. 

The speech data in this release was collected by LDC at its Human Subjects Collection facilities in Philadelphia. The telephone collection protocol was similar to other LDC telephone studies (e.g., Switchboard-2 Phase III Audio - LDC2002S06): recruited speakers were connected through a robot operator to carry on casual conversations lasting up to 10 minutes, usually about a daily topic announced by the robot operator at the start of the call. The raw digital audio content for each call side was captured as a separate channel, and each full conversation was presented as a 2-channel interleaved audio file, with 8000 samples/second and u-law sample encoding. Each speaker was asked to complete 15 calls.

The multi-microphone portion of the collection utilized 14 distinct microphones installed identically in two mutli-channel audio recording rooms at LDC. Each session was guided by collection staff using prompting and recording software to conduct the following activities: (1) repeat questions (less than one minute), (2) informal conversation (typically 15 minutes), (3) transcript reading (approximately 15 minutes) and (4) telephone call (generally 10 minutes). Speakers recorded up to three 45-minute sessions on distinct days. The 14 channels were recorded synchronously into separate single-channel files, using 16-bit PCM sample encoding at 16000 samples/second.

The recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) test sets for 2010 and 2012. Researchers interested in applying those benchmark test sets should consult the respective NIST Evaluation Plans for guidelines on allowable training data for those tests. The collection contains 4,410 recordings made via the public telephone network and 1,425 sessions of multiple microphone recordings in office-room settings. The telephone recordings are presented as 8-KHz 2-channel NIST SPHERE files, and the microphone recordings are 16-KHz 1-channel flac/ms-wav files. 

Mixer 6 Speech is distributed on one hard drive. 2013 Subscription Members will automatically receive one copy of this data on hard drive. 2013 Standard Members may request a copy as part of their 16 free membership corpora. As a Members-Only release, Mixer 6 Speech is not available for non-member licensing.

Monday, July 15, 2013

LDC July 2013 Newsletter


New publications:




Fall 2013 Data Scholarship Program


Applications are now being accepted through September 16, 2013, 11:59PM EST for the Fall 2013 LDC Data Scholarship program! The LDC Data Scholarship program provides university students access to LDC data at no-cost.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Corpus Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full Non-member Fee for the data and verify the student's need for data.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Fall 2013 program is Monday, September 16, 2013, 11:59PM EST.

New Publications

(1) Chinese Proposition Bank 3.0 is a continuation of the Chinese Proposition Bank project which aims to create a corpus of text annotated with information about basic semantic propositions. Chinese Proposition Bank 3.0 adds predicate-argument annotation on 187,731 words from Chinese Treebank 7.0 (LDC2010T07). The data sources are comprised of newswire, magazine articles, various broadcast news and broadcast conversation programming, web newsgroups and weblogs. LDC has also released Chinese Proposition Bank 1.0 (LDC2005T23) and Chinese Proposition Bank 2.0 (LDC2008T07).

This release contains the predicate-argument annotation of 173,206 verb instances and 14,525 noun instances. The annotation of nouns is limited to nominalizations that have a corresponding verb. The general annotation guidelines and the lexical guidelines (called frame files) for each verbal and nominal predicate are also included in this release. Below are some statistics about the corpus.
  • Total propositions for verbs - 173,206
  • Total propositions for nouns - 14,525
  • Total verbs framed - 24,642
  • Total framesets - 26,467
  • Verbs with multiple framesets - 1337
  • Average framesets per verb - 1.07
  • Total nouns framed - 1,421
  • Total noun framesets - 1,528
  • Nouns with multiple framesets - 48
  • Average framesets per nouns - 1.08
Chinese Proposition Bank 3.0 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 was developed by LDC and contains 115,826 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this corpus corresponds to a portion of the Arabic treebanked data in Arabic Treebank - Broadcast News v1.0 (LDC2012T07).

The source data consists of Arabic broadcast news programming collected by LDC in 2005 and 2006 from Alhurra, Aljazeera and Dubai TV. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language
Files
Words
Tokens
Segments
Arabic
28
89,213
115,826
4,824

Note: Word count is based on the untokenized Arabic source. Ttoken count is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:
  • Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect)
  • Identifying sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented segments, segments with foreign languages
  • Tagging unmatched words attached to other words or phrases
GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, June 17, 2013

LDC June 2013 Newsletter

High School students use LDC data

New publications:



High School students use LDC data

A team of students at Thomas Jefferson High School for Science and Technology in Alexandria, VA, USA, have used an LDC database for the development of a device to help autistic children recognize emotions. This team was funded by a grant from the Lemelson-MIT InvenTeam Initiative Program. InvenTeams are groups of high school students, teachers, and mentors that receive grants up to US$10,000 each to invent technological solutions to real-world problems.

The team set out to invent an emotive aid in the form of a bracelet that uses a computational algorithm to extract emotional signatures from speech and display expressed emotions in real-time during a conversation. Potential beneficiaries include children with autism, Asperger’s syndrome, or similar diseases that impair the ability to detect emotion. The algorithm employed machine learning and neural network-based techniques to improve accuracy and efficiency relative to current methods.

The students used speech samples from the LDC database,
Emotional Prosody Speech and Transcripts (LDC2002S28) as well the Berlin Database of Emotional Speech for training and testing their algorithm. Although the samples proved to be too small to produce an algorithm with a high degree of accuracy, the team's algorithm did demonstrate some degree of success. The students will present their results at Eurekafest at MIT in June.

LDC thanks the InvenTeam’s teacher, Mark Hannum, and group leader, Suhas Gondi, for contributing to this article.
  
New publications

(1) GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC in 2006 and 2007 and transcribed by LDC or under its direction.

This release includes 21 source-translation document pairs, comprising 146,082 characters of Chinese source text and its English translation. Data is drawn from seven distinct Chinese programs broadcast in 2006 and 2007 from the following sources -- China Central TV, a national and international broadcaster in Mainland China and Phoenix TV, a Hong Kong-based satellite television station. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics.

The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDCs Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Greybeard was developed by LDC and is comprised of approximately 590 hours of English telephone conversation speech collected in October and November 2008 by LDC. The goal was to record new telephone conversations among subjects who had participated in one or more previous LDC telephone collections, from Switchboard-1 (1991) through the Mixer studies (2006).

A total of 172 subjects were enrolled in the Greybeard collection, all of whom had participated in one of the following:
  • Switchboard-1 (LDC97S62) 1991-1992: 2 subjects
  • Switchboard-2 (LDC98S75, LDC99S79, LDC2002S06) 1996-1997: 16 subjects
  • Mixer 1 and 2 2003-2005: 103 subjects
  • Mixer 3 2006: 51 subjects
Most Greybeard participants completed 12 calls. Some subjects completed up to 24 calls. Calls were made or received via an automatic operator system at LDC which connected two participants and announced a topic for discussion. 

This release consists of 4680 calls -- the complete set of calls recorded during the Greybeard collection (1098 calls) as well as all calls from the legacy collections that involved the Greybeard speakers.

The audio from each call was captured digitally by the operator system and stored in a separate file as raw mu-law sample data. As the recordings were uploaded daily from the robot operator to network disk storage, automated processes reformatted the audio into a 2-channel SPHERE-format file for each conversation and queued the recordings for manual audit to verify speaker identification and to check other aspects of the recording. 

Auditors provided impressionistic judgments on overall audio quality, presence of background noise and cross-channel echo and any other technical difficulty with the call, in addition to confirming the speaker-ID on each channel.

Greybeard is distributed on five DVDs. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Manually Annotated Sub-Corpus Third Release (MASC) was developed as part of The American National Corpus project and consists of approximately 500,000 words of contemporary American English written and spoken data annotated for a wide variety of linguistic phenomena. 

The MASC project was established to address, to the extent possible, many of the obstacles to the creation of large-scale, robust, multiply-annotated corpora of English covering a wide range of genres of written and spoken language data. The project provides appropriate data and annotations to serve as the base for a community-wide annotation effort, together with an infrastructure that enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or transduced to any of a variety of other formats. Further information about the project is available at the MASC website.

The source texts were drawn from the open portion of the American National Corpus Second Release, and from the Language Understanding Annotation Corpus.  MASC Third Release includes the contents of MASC First Release (LDC2010T22) (82,000 words) which is also available from LDC. There is no second release.

All data in this release was annotated for logical structure (paragraph, headings, etc.), token and sentence boundaries, part of speech and lemma, shallow parse (noun and verb chunks) and named entities (person, organization, location and date). Portions of the corpus were also annotated for FrameNet frames (40k full text), Penn Treebank syntax (82k) and opinion (50k). 

Manually Annotated Sub-Corpus Third Release is distributed via web download.
2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may request this data by submitting a signed copy of LDC User Agreement for Non-members. This data is available at no-cost.

Thursday, May 16, 2013

LDC May 2013 Newsletter

 
New publications



LDC at ICASSP 2013

LDC will be at ICASSP 2013, the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The event will be held over May 26-31 and we look forward to interacting with members of this community at our exhibit table and during our poster and paper presentations:
Tuesday, May 28, 15:30 - 17:30, Poster Area D
ARTICULATORY TRAJECTORIES FOR LARGE-VOCABULARY SPEECH RECOGNITION
Authors: Vikramjit Mitra, Wen Wang, Andreas Stolcke, Hosung Nam, Colleen Richey, Jiahong Yuan (LDC), Mark Liberman (LDC)
Tuesday, May 28, 16:30 - 16:50, Room 2011
SCALE-SPACE EXPANSION OF ACOUSTIC FEATURES IMPROVES SPEECH EVENT DETECTION
Authors: Neville Ryant, Jiahong Yuan, Mark Liberman (all LDC)
Wednesday, May 29, 15:20 - 17:20, Poster Area D
USING MULTIPLE VERSIONS OF SPEECH INPUT IN PHONE RECOGNITION
Authors: Mark Liberman (LDC), Jiahong Yuan (LDC), Andreas Stolcke, Wen Wang, Vikramjit Mitra
Please look for LDC’s exhibition at Booth #53 in the Vancouver Convention Centre. We hope to see you there!


Early renewing members save on fees

To date just over 100 organizations have joined for Membership Year (MY) 2013.   For the sixth straight year, LDC's early renewal discount program has resulted in significant savings for our members.  Organizations that renewed membership or joined early for MY2013 saved over US$50,000! MY 2012 members are still eligible for a 5% discount when renewing for MY2013. This discount will apply throughout 2013.

Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora.  For-profit members can use most LDC data for commercial applications.  Please visit our
Members FAQ for further information.

Commercial use and LDC data

Has your company obtained an LDC database as a non-member?  For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases.  Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose.  LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. In the case of a small group of corpora such as American National Corpus (ANC) Second Release (LDC2005T35), Buckwalter Arabic Morphological Analyzer Version 2.0 (LDC2004L02), CELEX2 (LDC96L14) and all CSLU corpora, commercial licenses must be obtained separately from the owners of the data even if an organization is a for-profit member.

New publications

(1) GALE Arabic-English Parallel Aligned Treebank -- Newswire (LDC2013T10) was developed by LDC and contains 267,520 tokens of word aligned Arabic and English parallel text with treebank annotations. This material was used as training data in the DARPA GALE  (Global Autonomous Language Exploitation) program. Parallel aligned treebanks are treebanks annotated with morphological and syntactic structures aligned at the sentence level and the sub-sentence level. Such data sets are useful for natural language processing and related fields, including automatic word alignment system training and evaluation, transfer-rule extraction, word sense disambiguation, translation lexicon extraction and cultural heritage and cross-linguistic studies. With respect to machine translation system development, parallel aligned treebanks may improve system performance with enhanced syntactic parsers, better rules and knowledge about language pairs and reduced word error rate.

In this release, the source Arabic data was translated into English. Arabic and English treebank annotations were performed independently. The parallel texts were then word aligned. The material in this corpus corresponds to the Arabic treebanked data appearing in Arabic Treebank: Part 3 v 3.2 (LDC2010T08) (ATB) and to the English treebanked data in English Translation Treebank: An-Nahar Newswire (LDC2012T02).

The source data consists of Arabic newswire from the Lebanese publication An Nahar collected by LDC in 2002. All data is encoded as UTF-8. A count of files, words, tokens and segments is below.

Language
Files
Words
Tokens
Segments
Arabic
364
182,351
267,520
7,711

Note: Word count is based on the untokenized Arabic source and token count is based on the ATB-tokenized Arabic source.

The purpose of the GALE word alignment task was to find correspondences between words, phrases or groups of words in a set of parallel texts. Arabic-English word alignment annotation consisted of the following tasks:
Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect)
Identifying sentence segments not suitable for annotation, e.g., blank segments, incorrectly-segmented segments, segments with foreign languages
Tagging unmatched words attached to other words or phrases
GALE Arabic-English Parallel Aligned Treebank -- Newswire is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) MADCAT Phase 2 Training Set (LDC2013T09) contains all training data created by LDC to support Phase 2 of the DARPA MADCAT (Multilingual Automatic Document Classification Analysis and Translation)Program. The data in this release consists of handwritten Arabic documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output. 

The goal of the MADCAT program is to automatically convert foreign text images into English transcripts. MADCAT Phase 2 data was collected from Arabic source documents in three genres: newswire, weblog and newsgroup text. Arabic speaking scribes copied documents by hand, following specific instructions on writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some original source documents being broken into multiple pages for handwriting. Each resulting handwritten page was assigned to up to five independent scribes, using different writing conditions. 

The handwritten, transcribed documents were checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text. The annotation results in GEDI XML output files (gedi.xml), which include ground truth annotations and source transcripts.

The final step was to produce a unified data format that takes multiple data streams and generates a single MADCAT XML output file with all required information. The resulting madcat.xml file has these distinct components: (1) a text layer that consists of the source text, tokenization and sentence segmentation, (2)  an image layer that consist of bounding boxes, (3) a scribe demographic layer that consists of scribe ID and partition (train/test) and (4) a document metadata layer. 

This release includes 27,814 annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and madcat.xml) along with their corresponding scanned image files in TIFF format.

MADCAT Phase 2 Training Set is distributed on six DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.