Friday, April 20, 2012

LDC April 2012 Newsletter

 



LDC Timeline – Two Decades of Milestones
April 15 marks the “official” 20th anniversary of LDC’s founding. We’ll be featuring highlights from the last two decades in upcoming newsletters, on the web and elsewhere.  For a start, here’s a brief timeline of significant milestones.
  • 1992: The University of Pennsylvania is chosen as the host site for LDC in response to a call for proposals issued by DARPA; the mission of the new consortium is to operate as a specialized data publisher and archive guaranteeing widespread, long-term availability of language resources. DARPA provides seed money with the stipulation that LDC become self-sustaining within five years. Mark Liberman assumes duties as LDC’s Director with a staff that grows to four, including Jack Godfrey, the Consortium’s first Executive Director.
  • 1993: LDC’s catalog debuts. Early releases include benchmark data sets such as TIMIT, TIPSTER, CSR and Switchboard, shortly followed by the Penn Treebank. 
  • 1994: LDC and NIST (the National Institute of Standards and Technology) enter into a Cooperative R&D Agreement that provides the framework for the continued collaboration between the two organizations.
  • 1995: Collection of conversational telephone speech and broadcast programming and transcription commences. LDC begins its long and continued support for NIST common task evaluations by providing custom data sets for participants. Membership and data license fees prove sufficient to support LDC operations, satisfying the requirement that the Consortium be self-sustaining.
  • 1997: LDC announces LDC Online, a searchable index of newswire and speech data with associated tools to compute n-gram models, mutual information and other analyses.
  • 1998: LDC adds annotation to its task portfolio. Christopher Cieri joins LDC as Executive Director and develops the annotation operation.
  • 1999: Steven Bird joins LDC; the organization begins to develop tools and best practices for general use. The Annotation Graph Toolkit results from this effort.
  • 2000: LDC expands its support of common task evaluations from providing corpora to coordinating language resources across the program. Early examples include the DARPA TIDES, EARS and GALE programs.
  • 2001: The Arabic treebank project begins.
  • 2002: LDC moves to its current facilities at 3600 Market Street, Philadelphia with a full-time staff of approximately 40 persons.
  • 2004: LDC introduces the Standard and Subscription membership options, allowing members to choose whether to receive all or a subset of the data sets released in a membership year.
  • 2005: LDC makes task specifications and guidelines available through its projects web pages.
  • 2008: LDC introduces programs that provide discounts for continuing members and those who renew early in the year.
  • 2010: LDC inaugurates the Data Scholarship program for students with a demonstrable need for data.
  • 2012: LDC’s full-time staff of 50 and 196 part-time staff support ongoing projects and operations which include collecting, developing and archiving data, data annotation, tool development, sponsored-project support and multiple collaborations with various partners. The general catalog contains over 500 holdings in more than 50 languages. Over 85,000 copies of more than 1300 titles have been distributed to 3200 organizations in 70 countries. 

New Publications

(1) 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News was developed by researchers at the Department of Computer Science and Engineering, University of South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National Institute of Standards and Technology (NIST). It contains approximately 60 hours of English broadcast news video data collected by LDC in 1998 and annotated for the 2005 VACE (Video Analysis and Content Extraction) tasks. The tasks covered by the broadcast news domain were human face (FDT) tracking, text strings (TDT) (glyphs rendered within the video image for the text object detection and tracking task) and word level text strings (TDT_Word_Level) (videotext OCR task). 

The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding. During VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects including faces, hands, people, vehicles and text in four primary video domains: broadcast news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial results were also obtained on automatic analysis of human activities and understanding of video sequences. 

Three performance evaluations were conducted under the auspices of the VACE program between 2004 and 2007. The 2005 evaluation was administered by USF in collaboration with NIST and guided by an advisory forum including the evaluation participants.

The broadcast news recordings were collected by LDC in 1998 from CNN Headline News (CNN-HDL) and ABC World News Tonight (ABC-WNT). CNN HDL is a 24-hour/day cable-TV broadcast which presents top news stories continuously throughout the day. ABC-WNT is a daily 30-minute news broadcast that typically covers about a dozen different news items. Each daily ABC-WNT broadcast and up to four 30-minute sections of CNN-HDL were recorded each day. The CNN segments were drawn from that portion of the daily schedule that happened to include closed captioning. 

2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News is distributed on one hard drive.2012 Subscription Members will automatically receive one copy of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$6000.
*

(2) 2009 CoNLL Shared Task Part 1 contains the Catalan, Czech, German and Spanish trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations, including the semantic dependencies model roles of both verbal and nominal predicates. 

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2008, the shared task focused on English and employed a unified dependency-based formalism and merged the task of syntactic dependency parsing and the task of identifying semantic arguments and labeling them with semantic roles; that data has been released by LDC as 2008 CoNLL Shared Task Data (LDC2009T12). The 2009 task extended the 2008 task to several languages (English plus Catalan, Chinese, Czech, German, Japanese and Spanish). Among the new features were comparison of time and space complexity based on participants' input, and learning curve comparison for languages with large datasets.
The 2009 shared task was divided into two subtasks:

(1) parsing syntactic dependencies

(2) identification of arguments and assignment of semantic roles for each predicate
The materials in this release consist of excerpts from the following corpora:
  • Ancora (Spanish + Catalan): 500,000 words each of annotated news text developed by the University of Barcelona, Polytechnic University of Catalonia, the University of Alacante and the University of the Basque Country
  • Prague Dependency Treebank 2.0 (Czech): approximately 2 million words of annotated news, journal and magazine text developed by Charles University; also available through LDC, LDC2006T01
  • TIGER Treebank + SALSA Corpus (German): approximately 900,000 words of annotated news text and FrameNet annotation developed by the University of Potsdam, Saarland University and the University of Stuttgart
2009 CoNLL Shared Task Part 1 is distributed on one DVD. 2012 Subscription Members will automatically receive two copies of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$200.  

*

(3) 2009 CoNLL Shared Task Part 2 contains the Chinese and English trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations, including the semantic dependencies model roles of both verbal and nominal predicates. 

The materials in this release consist of excerpts from the following corpora:
  • Penn Treebank II (LDC95T7) (English): over one million words of annotated English newswire and other text developed by the University of Pennsylvania
  • PropBank (LDC2004T14) (English): semantic annotation of newswire text from Treebank-2 developed by the University of Pennsylvania
  • NomBank (LDC2008T23) (English): argument structure for instances of common nouns in Treebank-2 and Treebank-3 (LDC99T42) texts developed by New York University
  • Chinese Treebank 6.0 (LDC2007T36)(Chinese): 780,000 words (over 1.28 million characters) of annotated Chinese newswire, magazine and administrative texts and transcripts from various broadcast news programs developed by the University of Pennsylvania and the University of Colorado
  • Chinese Proposition Bank 2.0 (LDC2008T07) (Chinese): predicate-argument annotation on 500,000 words from Chinese Treebank 6.0 developed by the University of Pennsylvania and the University of Colorado
2009 CoNLL Shared Task Part 2 is distributed on one CD. 2012 Subscription Members will automatically receive two copies of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$850.

*
(4) USC-SFI MALACH Interviews and Transcripts English was developed by The University of Southern California's Shoah Foundation Institute (USC-SFI), the University of Maryland, IBM and Johns Hopkins University as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 375 hours of interviews from 784 interviewees along with transcripts and other documentation.

Inspired by his experience making Schindler's List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. While most of those who gave testimony were Jewish survivors, the Foundation also interviewed homosexual survivors, Jehovah's Witness survivors, liberators and liberation witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes trials participants.  In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education. 

The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives; the focus was advancing the state of the art of automatic speech recognition (ASR) and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related co-articulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak. USC-SFI MALACH Interviews and Transcripts English was developed for the English speech recognition experiments. 

The speech data in this release was collected beginning in 1994 under a wide variety of conditions ranging from quiet to noisy (e.g., airplane over-flights, wind noise, background conversations and highway noise). Approximately 25,000 of all USC-SFI collected interviews are in English and average approximately 2.5 hours each. The 784 interviews included in this release are each a 30 minute section of the corresponding larger interview. The interviews include accented speech over a wide range (e.g., Hungarian, Italian, Yiddish, German and Polish). 

This release includes transcripts of the first 15 minutes of each interview. The transcripts were created using Transcriber 1.5.1 and later modified.

USC-SFI MALACH Interviews and Transcripts English is distributed on five DVDs. 2012 Subscription Members will automatically receive two copies of this data provided that they have submitted a completed copy of the User License Agreement for USC-SFI MALACH Interviews and Transcripts English (LDC2012S05). 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

Tuesday, March 20, 2012

LDC March 2012 Newsletter

New publications:

2012 LDC Survey Responses and Benefit Winner
Thanks to all who participated in the 2012 LDC Survey. Your responses were thoughtful and informative. We’re now analyzing the results; stay tuned for an announcement on the survey findings.
In the meantime, please join us in congratulating Todor Ganchev from the University of Patras, Wire Communications Laboratory (WCL) for winning the survey participation benefit! As a reminder, one $500 benefit was awarded to a blindly-selected participant whose response was received by February 7, 2012.
LDC at ICASSP 2012
LDC will be traveling across the globe to exhibit at its first IEEE-hosted event. The 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) will be held at the Kyoto International Conference Center in Kyoto, Japan, on March 25 - 30, 2012.
The ICASSP meeting is the world’s largest and most comprehensive technical conference focused on signal processing and its applications, and LDC is looking forward to interacting with members of this community. Please look for LDC’s exhibition at Booth #14 in the Annex Hall. We hope to see you there!
New Publications
(1) English Translation Treebank: An Nahar Newswire was developed by LDC and consists of 599 distinct newswire stories from the Lebanese publication An Nahar translated from Arabic to English and annotated for part-of-speech and syntactic structure.
This corpus is part of an ongoing effort at LDC to produce parallel Arabic and English treebanks. The guidelines followed for both part-of-speech and syntactic annotation are Penn Treebank II style, with changes in the tokenization of hyphenated words, part-of-speech and tree changes necessitated by those tokenization changes and revisions to the syntactic annotation to comply with the updated annotation guidelines (including the "Treebank-PropBank merge" or "Treebank IIa" and "treebank c" changes). The original Penn Treebank II guidelines, addenda describing changes to the guidelines and the tokenization specifications can be found on LDC's website.
The data consists of 461,489 tokens in 599 individual files. The news stories in this release were published in An Nahar in 2002.
The English sources files (translated from the Arabic) were automatically tokenized, part-of-speech tagged and parsed; the tokens, tags and parses were manually corrected. The quality control process consisted of a series of specific searches for over 100 types of potential inconsistency and parse or annotation error. Any errors found in those searches were manually corrected.
Annotations are in the following two formats:
  • Penn Style Trees
    • Bracketed tree files following the basic form (NODE (TAG token)). Each sentence is surrounded by a pair of empty parentheses.
  • AG xml
    • TreeEditor .xml stand-off annotation files. These files contain the POS and Treebank annotation and reference the source files by character offset. DTD files for the AG xml files were moved from their original location indicated in the readme to be more consistent with LDC publications.
English Translation Treebank: An Nahar Newswire is distributed via web download. 2012 Subscription Members will automatically receive two copies of this corpus on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$4500.
*
(2) Malto Speech and Transcripts was developed by Masato Kobayashi, Associate Professor in Linguistics at the University of Tokyo (Japan), and Bablu Tirkey, research scholar at the Tribal and Regional Languages Department, Ranchi University (India). It contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto.
Malto is a Dravidian language spoken in northeastern India (principally the states of Bihar, Jharkhand and West Bengal) and Bangladesh by people called the Pahariyas. Indian census data places the number of Malto speakers in a range of between 100,000-200,000 total speakers. Most Malto speakers live in the three northeastern districts of Jharkhand, i.e, Sahebganj, Godda and Pakur; the fieldwork that resulted in this corpus was conducted in those districts. Of the Pahariyas in that area, three subtribes, the Sawriya Pahariyas, the Mal Pahariyas and the Kumarbhag Pahariyas, primarily speak Malto.
The transcribed data accounts for 6 hours of the collection and contains 21 speakers (17 male, 4 female). The untranscribed data accounts for 2 hours of the collection and contains 10 speakers (9 male, 1 female). Four of the male speakers are present in both groups.
All audio is presented in .wav format. Each audio file name includes a subject number, village name, speaker name and the topic discussed. The transcripts and glossary are UTF-8 text files. Because of ambiguities that occur when writing Malto in Devenagari script, the transcripts were developed using Roman script with symbols adapted from the International Phonetic Alphabet (IPA) but are not considered phonetic transcripts.
Malto Speech and Transcripts is distributed on 1 DVD. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. The first 100 copies distributed to non-member organizations are available at no charge. Shipping and handling fees apply.

Wednesday, February 15, 2012

LDC February 2012 Newsletter

Spring 2012 LDC Data Scholarship Recipients! -

Membership Fee Savings and Publications Pipeline for MY2012 -

New publications:

LDC2012S03
- Digital Archive of Southern Speech (DASS)
-

LDC2012T01
- ModeS TimeBank 1.0
-



Spring 2012 LDC Data Scholarship Recipients!

LDC is pleased to announce the student recipients of the Spring 2012 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen six proposals to support. The following students will receive no-cost copies of LDC data:

Zainab Ali Khalaf – University of Science, Malaysia (Malaysia), graduate student, Computer Science. Zainab has been awarded a copy of 1996 English Broadcast News Transcripts (HUB4) (LDC97T22) for her work in spoken document retrieval.

Daniel Jettka – Trinity College Dublin (Ireland), graduate student, Centre for Language & Communication Studies. Daniel has been awarded copies of Penn Discourse Treebank Version 2.0 (LDC2008T05) and RST Discourse Treebank (LDC2002T07) for his work in anaphora resolution.

Olga Nickolaevna Ladoshko - National Technical University of Ukraine “KPI” (Ukraine), graduate student, Acoustics and Acoustoelectronics. Olga has been awarded copies of NTIMT (LDC93S2) and STC-TIMIT 1.0 (LDC2008S03) for her research in automatic speech recognition for Ukrainian.

Ming Yang, Xiaoxiao Ma, and Jiajia Huang – Wuhan University (China), graduate students, Computer Science. Ming, Xiaoxiao, and Jiajia have been awarded copies of ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07) and GALE Phase 1 Chinese Broadcast News Parallel Text – Part 1 (LDC2007T23) for their work in summarization and data mining.

Daria Vazhenina – University of Aizu (Japan), graduate student, Human Interface Lab. Daria has been awarded a copy of 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set (LDC2011S06) for her work in speaker diarization.

Tanina Zappone - University of Rome “La Sapienza” (Italy), graduate student, Oriental Studies. Tanina has been awarded a copy of Chinese Treebank 7.0 (LDC2010T07) for her work in China’s political communications.

Please join us in congratulating our student recipients! The next LDC Data Scholarship program is scheduled for the Fall 2012 semester.

Membership Fee Savings and Publications Pipeline for MY2012

Time is quickly running out to save on membership fees for MY2012! Any organization which joins or renews membership for 2012 through Thursday, March 1, 2012, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2011 can receive a 10% discount on fees provided they renew prior to March 1, 2012.

Many publications for MY2012 are still in development. The planned publications for the upcoming months include:

ARRAU (Anaphor Resolution and Underspecification) ~ data annotated for anaphoric relations, with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. The corpus contains texts from various genres: task-oriented dialogues from the TRAINS project, narratives from the English Pear Stories, and newspaper articles from the Wall Street Journal portion of the Penn Treebank.

MALACH English ~ over 300 hours of English audio recordings of interviews conducted under the auspices of the USC Shoah Foundation Institute for Visual History and Education and associated transcripts produced as part of the Multilingual Access to Large Spoken ArCHives (MALACH) project.

Malto Speech and Transcripts ~ speech files of Malto narratives recorded by Masato Kobayashi and Bablu Tirkey with associated transcripts. Malto is a Dravidian language spoken in northeastern India and Bangladesh.

NIST/USF Evaluation Resources for the VACE Program – Broadcast News ~ English broadcast news video annotated for the VACE (Video Analysis and Content Extraction) 2005 face, text and text word detection and tracking tasks.

OntoNotes 5.0 ~ multiple genres of English, Chinese, and Arabic text annotated for syntax, predicate argument structure and shallow semantics.

2012 Subscription Members are automatically sent all MY2012 data as it is released. 2012 Standard Members are entitled to request 16 corpora for free from MY2012. Non-members may license most data for research use.

New publications

(1) Digital Archive of Southern Speech (DASS) was developed by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in turn part of the Linguist Atlas Project (LAP). DASS contains approximately 370 hours of English speech data from 30 female speakers and 34 male speakers in .wav format and in .mp3 format, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations.

LAP consists of a set of survey research projects about the words and pronunciation of everyday American English, the largest project of its kind in the United States. Interviews with thousands of native speakers across the country have been carried out since 1929. LAGS surveyed the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews conducted from 1968-1983. Interviews average approximately six hours in length; the systematic LAGS tape archive amounts to 5500 hours of sound recordings. DASS is a collection of 64 interviews from LAGS selected to cover a range of speech across the region and to represent multiple education levels and ethnic backgrounds.

Also included in this release is a version of the LICHEN software developed at the University of Oulu, Finland. LICHEN allows users to browse and search through the audio data in a more advanced fashion using a graphical interface.

Digital Archive of Southern Speech (DASS) is distributed on one hard disc drive. 2012 Subscription Not-for-Profit/US Government Members will automatically receive one copy of this data. 2012 For-Profit Members will receive a copy provided that they have submitted a completed copy of the User License Agreement for Digital Archive of Southern Speech (LDC2012S03). 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$250.

*

(2) ModeS TimeBank 1.0 was developed by researchers at Technical University of Madrid and Barcelona Media and is a corpus of Modern Spanish (17th and 18th centuries) annotated with temporal and event information according to TimeML mark-up and annotated with spatial information following the SpatialML scheme.

TimeML (Pustejovsky et al., 2005) is a specification language for annotating eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. SpatialML (Mani et al., 2008) is a specification language for annotating and normalizing spatial expressions by means of geographic coordinates.

ModeS TimeBank 1.0 contains 102 documents reporting a sea-crossing cruise by a ship called La Princesa, which took place from December 1768 to April 1769. There exist copious logbooks from that period that not only provide information about shipping routes, but also contain valuable data concerning information flows, commercial agents and social networks.

All text is encoded in UTF-8. The data in ModeS TimeBank 1.0 has been tokenized, POS-tagged, and annotated with space, time and event information according to the TimeML and SpatialML specification schemes.

ModeS TimeBank 1.0 is distributed via web download. 2012 Subscription Members will automatically receive two copies of this corpus on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may request this data by completing a copy of the LDC User Agreement for Non-Members. The agreement can be faxed +1 215 573 2175 or scanned and emailed to this address. This data is available at no charge.

Friday, January 20, 2012

LDC January 2012 Newsletter

New publications:

LDC Celebrates its 20th Anniversary!
2012 marks LDC’s 20th Anniversary year – officially on April 15 – but this is cause for a yearlong celebration! From our founding in 1992 as a data repository and language resource distribution center, our online catalog has grown to include over 500 databases in 60 languages that have been licensed by over 3000 organizations from 80 different nations. This data has been made available through donations, funded projects at LDC or elsewhere, community initiatives, and from LDC resources, an indication of the collective strength of this consortium. LDC has evolved from an organization that shares language resources to one that also is at the forefront of language technology research that includes the development of new data resources, software tools, and standards and best practices.
As we celebrate throughout the year, look for announcements and special features in our newsletter and on our Facebook page.
2012 LDC Survey – Be on the Lookout!
It’s been four years since our last survey of LDC members and data licensees and we would like to again ask you to share your views on LDC and its language resources as well as your thoughts about data distribution in general and the impact of social media on language-related research and technology development. These topics are particularly timely as LDC enters its 20th anniversary year.
The 2012 LDC Survey will be sent to every person and organization that licensed LDC data and/or joined LDC as a Member during the period from 2009 through 2011. Those who complete the survey on or before February 7, 2012 will make their organization eligible for a $500 benefit to be applied to any corpus or membership purchase in 2012. LDC will conduct a blind drawing and one lucky winner will be chosen from the pool of respondents.
Many thanks for your continued support and for your participation in the 2012 Survey!
Membership Discounts for MY 2012 Still Available
If you are considering joining for Membership Year 2012 (MY2012), there is still time to save on membership fees. Any organization which joins or renews membership for 2012 through Thursday, March 1, 2012, is entitled to a 5% discount on membership fees. Organizations that held membership for MY2011 can receive a 10% discount on fees provided they renew prior to March 1, 2012. For further information on pricing, please consult our Announcements page or contact LDC.
New Publications
(1) 2006 NIST Speaker Recognition Evaluation Test Set Part 2 was developed by LDC and National Institute of Standards and Technology (NIST). It contains 568 hours of conversational telephone and microphone speech in English, Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu and associated English transcripts used as test data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE).
The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational telephone speech. The task was divided into 15 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the test conditions and additional documentation is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation Plan.
The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu.
The telephone speech segments are multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into four types: two-channel excerpts of approximately 10 seconds, two-channel conversations of approximately 5 minutes, summed-channel conversations also of approximately 5 minutes and a two-channel conversation with the usual telephone speech replaced by auxiliary microphone data in the putative target speaker channel. The auxiliary microphone conversations are also of approximately five minutes in length. English language transcripts in .ctm format were produced using an automatic speech recognition (ASR) system.
2006 NIST Speaker Recognition Evaluation Test Set Part 2 is distributed on seven DVDs. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.
*
(2) TORGO Database of Dysarthric Articulation was developed by the University of Toronto's departments of Computer Science and Speech Language Pathology in collaboration with the Holland-Bloorview Kids Rehabilitation Hospital in Toronto, Canada. It contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS) and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
CP and ALS are examples of dysarthria which is caused by disruptions in the neuro-motor interface that distort motor commands to the vocal articulators, resulting in atypical and relatively unintelligible speech in most cases. The TORGO database is primarily a resource for developing advanced automatic speaker recognition (ASR) models suited to the needs of people with dysarthria, but it is also applicable to non-dysarthric speech. The inability of modern ASR to effectively understand dysarthric speech is a problem since the more general physical disabilities often associated with the condition can make other forms of computer input, such as computer keyboards or touch screens, difficult to use.
The data consists of aligned acoustics and measured 3D articulatory features from the speakers carried out using the 3D AG500 electro-magnetic articulograph (EMA) system (Carstens Medizinelektronik GmbH, Lenglern, Germany) with fully-automated calibration. This system allows for 3D recordings of articulatory movements inside and outside the vocal tract, thus providing a detailed window on the nature and direction of speech-related activity.
All subjects read text consisting of non-words, short words and restricted sentences from a 19-inch LCD screen. The restricted sentences included 162 sentences from the sentence intelligibility section of Assessment of intelligibility of dysarthric speech (Yorkston & Beukelman, 1981) and 460 sentences derived from the TIMIT database. The unrestricted sentences were elicited by asking participants to spontaneously describe 30 images in interesting situations taken randomly from Webber Photo Cards - Story Starters (Webber, 2005), designed to prompt students to tell or write a story.
Data is organized by speaker and by the session in which each speaker recorded data. Each speaker's directory contains 'Session' directories which encapsulate data recorded in the respective visit and occasionally, a 'Notes' directory which can include Frenchay assessments (test for the measurement, description and diagnosis of dysarthria), notes about sessions (e.g., sensor errors), and other relevant notes.
TORGO Database of Dysarthric Articulation is distributed on 4 DVDs. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1200.

Thursday, December 15, 2011

LDC December 2011 Newsletter

Spring 2012 LDC Data Scholarship Program - deadline approaching!
New publications

Spring 2012 LDC Data Scholarship Program - deadline fast approaching!
The deadline for the Spring 2012 LDC Data Scholarship Program is less than a month away! Applications are being accepted through January 15, 2012. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the
LDC Data Scholarship page.
Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.
LDC Exhibiting at LSA 2012 Annual Meeting
LDC looks forward to mingling with linguists and language specialists when we exhibit at the 86th Annual Meeting of the Linguistic Society of America (LSA). The main conference will be held over January 5-8, 2012 at the Portland, OR Hilton and Executive Tower and the exhibit hall will be open from January 6-8th (limited hours on Sunday the 8th). Please stop by our display for news on what 2012 will hold for LDC and to receive some of our conference giveaways.
LSA 2012 will feature plenary talks on the following topics:
  • Patrice Speeter Beddor (University of Michigan): "The Dynamics of Speech Perception: Constancy, Variation, and Change"
  • Dan Jurafsky (Stanford University): "Computing Meaning: Learning and Extracting Meaning from Text"
  • Ted Supalla (University of Rochester): "Rethinking the Emergence of Grammatical Structure in Signed Languages: New Evidence from Variation and Historical Change in American Sign Language"
For further information visit the LSA Annual Meeting website. If you would like to learn more about LDC’s conference preparations, please ‘like’ our Facebook page.
We hope to see you there!

LDC Hosts Satellite Workshop at LSA 2012
LDC will co-host a satellite workshop entitled 'Sociolinguistic Archival Preparation' on January 4-5, 2012 in conjunction with the LSA 2012 Annual Meeting. This two-day workshop will focus on techniques to permit the archiving of data, for cross-community sharing of corpora as well as for subsequent 'panel' studies. Recent discussions within the field have concluded that present protocols need to be expanded to permit adequate archiving. Specifically:
  • Institutional Review Board (IRB) paperwork needs to be adapted to provide protection for interviewees while permitting their speech data to be more generally sharable (and therefore archiveable);
  • Demographic, situational, and attitudinal protocols are needed to provide a unified resource serving multiple research communities as well as the contributing researchers.
The sooner IRB forms and research protocols are aligned with each other, the sooner sharable, archiveable corpora will become available, permitting intergroup comparison and interdisciplinary collaboration.
LDC's Executive Director, Christopher Cieri, and LDC consultant and University of Arizona scholar, Malcah Yaeger-Dror, are the workshop organizers. This workshop is funded in part by the National Science Foundation (BCS#1144480). Further information about the workshop is available on the LSA Annual Meeting website.
LDC to Close for Winter Break
LDC will be closed from Monday, December 26, 2011 through Monday, January 2, 2012 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Tuesday, January 3, 2012. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.
Best wishes for a happy and safe holiday season!
New Publications
(1) 2006 NIST Speaker Recognition Evaluation Test Set Part 1 was developed by LDC and National Institute of Standards and Technology (NIST). It contains 437 hours of conversational telephone and microphone speech in English, Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu and associated English transcripts used as test data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE).
The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational telephone speech. The task was divided into 15 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the test conditions and additional documentation is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation Plan.
The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu.
The telephone speech segments are multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into four types: two-channel excerpts of approximately 10 seconds, two-channel conversations of approximately 5 minutes, summed-channel conversations also of approximately 5 minutes and a two-channel conversation with the usual telephone speech replaced by auxiliary microphone data in the putative target speaker channel. The auxiliary microphone conversations are also of approximately five minutes in length.
English language transcripts in .ctm format were produced using an automatic speech recognition (ASR) system.
2006 NIST Speaker Recognition Evaluation Test Set Part 1 is distributed on five DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.
*
(2) 2008 NIST Speaker Recognition Evaluation Supplemental Set was developed by LDC and National Institute of Standards and Technology (NIST) and contains additional data distributed after the main 2008 Speaker Recognition Evaluation (SRE). Specifically, the corpus consists of 770 hours of English microphone speech along with transcripts and other materials used as supplemental data in the 2008 NIST Speaker Recognition Evaluation (SRE) and in a follow-up evaluation to SRE08.
The 2008 evaluation was distinguished from prior evaluations by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario. The follow-up evaluation focused on speaker detection in the context of conversational interview type speech and was designed to measure the performance of SRE08 systems in previously unexposed test segment channel conditions.
The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English and bilingual English speakers. The microphone speech in this corpus is in English and consists of approximately 3 minute and 30 minute interview excerpts.
This supplemental data is split into four different parts which provide:
  • new training data distributed to 2008 SRE participants
  • additional data distributed to participants in the 2008 SRE follow-up evaluation
  • interviewer channel files for the 2008 SRE main test (released after the evaluations)
  • supplemental training data (released after the evaluations)
English language transcripts in .cfm format were produced using an automatic speech recognition (ASR) system and are included for some, but not all, speech data.
2008 NIST Speaker Recognition Evaluation Supplemental Set is distributed on five DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

Wednesday, November 16, 2011

LDC November 2011 Newsletter


New publications:



Spring 2012 LDC Data Scholarship Program

Applications are now being accepted through January 15, 2012 for the Spring 2012 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal must contain the applicant's name, university and field of study. The proposal should state which data the student plans to use and contain a description of their research project.

Applicants should consult the LDC Corpus Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two data sets; students may apply for additional data sets during the following cycle once they have completed processing of the initial data sets and publish or present work in some juried venue.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full Non-member Fee for the data and verify the student's need for data.

For further information on application materials and program rules, please visit the LDC Data Scholarship page. Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.
Invitation to Join for Membership Year (MY) 2012
Membership Year (MY) 2012, our 20th Anniversary Year, is open for joining! We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the consortium. For MY2012, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase. Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.

The details of our early renewal discounts for MY2012 are as follows:
· Organizations who joined for MY2011 will receive a 5% discount when renewing. This discount will apply throughout 2012, regardless of time of renewal. MY2011 members renewing before March 1, 2012 will receive an additional 5% discount, for a total 10% discount off the membership fee.
· New members as well as organizations who did not join for MY2011, but who held membership in any of the previous MYs (1993-2010), will also be eligible for a 5% discount provided that they join/renew before March 1, 2012.
The following table provides exact pricing information.

MY2012 Fee
MY2012 Fee
with 5% Discount*
MY2012 Fee
with 10% Discount**
Not-for-Profit /US Government




Standard
US$2400
US$2280
US$2160

Subscription
US$3850
US$3658
US$3465
For-Profit




Standard
US$24000
US$22800
US$21600

Subscription
US$27500
US$26125
US$24750
Publications for MY2012 are still being planned; here are the working titles of data sets we intend to provide:
· ARRAU 1.2 (Anaphor Resolution and Underspecification)
· TORGO Dysarthic Speech
· Arabic Treebank BN (broadcast news)
· GALE data – all phases and tasks
· Digital Archive of Southern Speech
· Chinese Dependency Treebank
In addition to receiving new publications, current year members of the LDC also enjoy the benefit of licensing older data at reduced costs; current year for-profit members may use most data for commercial applications.

This past year, LDC members who joined early or kept their membership current saved almost US$70,000 on membership fees. Be sure to keep an eye on your mail as all previous and current LDC members will be sent an invitation to join letter and renewal invoice for MY2012. Renew early for MY2012 to save today!
Why Become an LDC Member?
LDC is offering early renewal discounts on membership fees for Membership Year 2012 making now a good time to consider joining or renewing membership. LDC membership has the following advantages:


· LDC membership provides cost-effective access to an extensive and growing catalog that spans 20 years and includes over 500 multilingual speech, text, and video resources. Even if your organization only needs a few datasets from a given membership year, membership is often the most economical way to obtain current corpora. Additionally, the generous discounts that member organizations receive on older corpora reduce the cost of acquiring such datasets.

· All members enjoy unlimited use of LDC data within their organizations. For universities, there is no difference in cost between a departmental membership and one that is university-wide. Departments can therefore combine resources and establish one LDC membership for use by the entire university community. Likewise, for-profit members with multiple branches can maintain one membership for use by their entire organization.
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations, including commercial restrictions, on the use of certain corpora. In the case of a small group of corpora, commercial licenses must be obtained separately from the owners of the data.
LDC to Close for Thanksgiving Break
LDC would like to inform our customers that we will be closed on Thursday, November 24, 2011 and Friday, November 25, 2011 in observance of the US Thanksgiving Holiday. Our offices will reopen on Monday, November 28, 2011.
New Publications
(1) 2006 NIST Speaker Recognition Evaluation Training Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 595 hours of conversational telephone speech in English, Arabic, Bengali, Chinese, Hindi, Korean, Russian, Thai and Urdu and associated English transcripts used as training data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE). The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition.
The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational telephone speech. The task was divided into 15 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the test conditions and additional documentation is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation Plan.
The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in the above languages.
The telephone speech segments are multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into three types: two-channel excerpts of approximately 10 seconds, two-channel conversations of approximately 5 minutes and summed-channel conversations also of approximately 5 minutes.

English language transcripts in .ctm format were produced using an automatic speech recognition (ASR) system. 2006 NIST Speaker Recognition Evaluation Training Set is distributed on seven DVDs. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

*
(2) 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 was developed by researchers at the Department of Computer Science and Engineering, University of South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National Institute of Standards and Technology (NIST). It contains approximately twenty hours of meeting room video data collected in 2005 and 2006 and annotated for the VACE (Video Analysis and Content Extraction) 2006 face and person tracking tasks.
The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding. During VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects including faces, hands, people, vehicles and text in four primary video domains: broadcast news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial results were also obtained on automatic analysis of human activities and understanding of video sequences.
Three performance evaluations were conducted under the auspices of the VACE program between 2004 and 2007. In 2006, the VACE program and the European Union's Computers in the Human Interaction Loop (CHIL) collaborated to hold the Classification of Events, Activities and Relationships (CLEAR) Evaluation. This was an international effort to evaluate systems designed to analyze people, their identities, activities, interactions and relationships in human-human interaction scenarios, as well as related scenarios. The VACE program contributed the evaluation infrastructure (e.g., data, scoring, tools) for a specific set of tasks, and the CHIL consortium, coordinated by the Karlsruhe Institute of Technology, contributed a separate set of evaluation infrastructure. To the extent possible, the VACE and CHIL programs harmonized their evaluation protocols and metrics.
The meeting room data used for the 2006 test set was collected by the following sites in 2005 and 2006: Carnegie Mellon University (USA), University of Edinburgh (Scotland), IDIAP Research Institute (Switzerland), NIST (USA), Netherlands Organization for Applied Scientific Research (Netherlands) and Virginia Polytechnic Institute and State University (USA).
2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 is distributed on ten DVDs. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2500.
*
(3) Chinese Gigaword Fifth Edition was produced by LDC. It is a comprehensive archive of newswire text data that has been acquired from Chinese news sources by LDC at the University of Pennsylvania. Chinese Gigaword Fifth Edition includes all of the content of the fourth edition of Chinese Gigaword (LDC2009T27) plus new data covering the period from January 2009 through December 2010.
Eight distinct sources of Chinese newswire are represented here:
  • Agence France Presse(afp_cmn)
  • Central News Agency, Taiwan(cna_cmn)
  • Central News Service(cns_cmn)
  • Guangming Daily(gmw_cmn)
  • People's Daily(pda_cmn)
  • People's Liberation Army Daily(pla_cmn)
  • Xinhua News Agency(xin_cmn)
  • Zaobao Newspaper(zbn_cmn)
The seven-letter codes in the parentheses above are used for the directory names and data files for each source. Articles covering the period from January 2009 through December 2010 have been added to the Agence France Presse, Central News Agency (CNA), Central News Service, Guangming Daily, People's Liberation Army Daily and Xinhua News Agency data sets. The data from People's Daily covers the period from late June 2009 through December 2010. No new data from Zaobao has been added.
Chinese Gigaword Fifth Edition is distributed on one DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$6000.