Wednesday, May 16, 2012

LDC May 2012 Newsletter

 
New publications:





To date almost 100 organizations have joined for Membership Year (MY) 2012, our 20th anniversary year.   Once again LDC's early renewal discount program has resulted in significant savings for our members. Organizations that renewed membership or joined early for MY2012 saved almost US$60,000! MY 2011 members are still eligible for a 5% discount when renewing for MY2012. This discount will apply throughout 2012, regardless of time of renewal.

Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora. Please visit our Members FAQ for further information.

New Publications
(1) Chinese Dependency Treebank 1.0 was developed by the Harbin Institute of Technology's Research Center for Social Computing and Information Retrieval (HIT-SCIR). It contains 49,996 Chinese sentences (902,191 words) randomly selected from People's Daily newswire stories published between 1992 and 1996 and annotated with syntactic dependency structures. Ill-formed or short sentences were eliminated from the randomly-selected sentences prior to annotation. The data was segmented and annotated for part of speech (POS), syntactic structures, verb subclasses and noun compounds. Word segmentation and POS tagging were accomplished automatically using statistical models trained on a larger, annotated corpus of People's Daily newswire stories. Humans manually annotated the syntactic structures and corrected word segmentation errors. POS tags were not corrected.

The data is provided in the format of CoNLL-X and in UTF-8. Chinese Dependency Treebank 1.0 is distributed via web download. 2012 Subscription Members will automatically receive one copy of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$300.
*

(2) GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised machine translation training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction. 

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 includes 36 source-translation document pairs, comprising 169,109 words of Arabic source text and its English translation. Data is drawn from thirteen distinct Arabic programs broadcast between 2004 and 2007 from the following sources: Al Alam News Channel, Aljazeera, Dubai TV, Oman TV, and Radio Sawa. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics. 

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines which are included with this release. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. All data are encoded in UTF8. GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 is distributed via web download. 2012 Subscription Members will automatically receive one copy of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.
*

(3) Turkish Broadcast News Speech and Transcripts was developed by Boğaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval. 

The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio; the 2009 broadcasts were recorded from digital satellite transmissions. A quick manual segmentation and transcription approach was followed.

The data was recorded at 32 kHz and re-sampled at 16 kHz. After screening for recording quality, the files were segmented, transcribed, and verified. The segmentation occurred in two steps, an initial automatic segmentation followed by manual correction and annotation which included information such as background conditions and speaker boundaries. 

The transcription guidelines were adapted from the LDC HUB4 and quick transcription guidelines. An English version of the adapted guidelines is provided with the data. Manual segmentation and transcripts were created by native Turkish speakers at Boğaziçi University using Transcriber. The transcriptions are provided in the ISO-8859-9 (Latin5) character set.

Turkish Broadcast News Speech and Transcripts is distributed on four DVDs. 2012 Subscription Members will automatically receive one copy of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

Friday, April 20, 2012

LDC April 2012 Newsletter

 



LDC Timeline – Two Decades of Milestones
April 15 marks the “official” 20th anniversary of LDC’s founding. We’ll be featuring highlights from the last two decades in upcoming newsletters, on the web and elsewhere.  For a start, here’s a brief timeline of significant milestones.
  • 1992: The University of Pennsylvania is chosen as the host site for LDC in response to a call for proposals issued by DARPA; the mission of the new consortium is to operate as a specialized data publisher and archive guaranteeing widespread, long-term availability of language resources. DARPA provides seed money with the stipulation that LDC become self-sustaining within five years. Mark Liberman assumes duties as LDC’s Director with a staff that grows to four, including Jack Godfrey, the Consortium’s first Executive Director.
  • 1993: LDC’s catalog debuts. Early releases include benchmark data sets such as TIMIT, TIPSTER, CSR and Switchboard, shortly followed by the Penn Treebank. 
  • 1994: LDC and NIST (the National Institute of Standards and Technology) enter into a Cooperative R&D Agreement that provides the framework for the continued collaboration between the two organizations.
  • 1995: Collection of conversational telephone speech and broadcast programming and transcription commences. LDC begins its long and continued support for NIST common task evaluations by providing custom data sets for participants. Membership and data license fees prove sufficient to support LDC operations, satisfying the requirement that the Consortium be self-sustaining.
  • 1997: LDC announces LDC Online, a searchable index of newswire and speech data with associated tools to compute n-gram models, mutual information and other analyses.
  • 1998: LDC adds annotation to its task portfolio. Christopher Cieri joins LDC as Executive Director and develops the annotation operation.
  • 1999: Steven Bird joins LDC; the organization begins to develop tools and best practices for general use. The Annotation Graph Toolkit results from this effort.
  • 2000: LDC expands its support of common task evaluations from providing corpora to coordinating language resources across the program. Early examples include the DARPA TIDES, EARS and GALE programs.
  • 2001: The Arabic treebank project begins.
  • 2002: LDC moves to its current facilities at 3600 Market Street, Philadelphia with a full-time staff of approximately 40 persons.
  • 2004: LDC introduces the Standard and Subscription membership options, allowing members to choose whether to receive all or a subset of the data sets released in a membership year.
  • 2005: LDC makes task specifications and guidelines available through its projects web pages.
  • 2008: LDC introduces programs that provide discounts for continuing members and those who renew early in the year.
  • 2010: LDC inaugurates the Data Scholarship program for students with a demonstrable need for data.
  • 2012: LDC’s full-time staff of 50 and 196 part-time staff support ongoing projects and operations which include collecting, developing and archiving data, data annotation, tool development, sponsored-project support and multiple collaborations with various partners. The general catalog contains over 500 holdings in more than 50 languages. Over 85,000 copies of more than 1300 titles have been distributed to 3200 organizations in 70 countries. 

New Publications

(1) 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News was developed by researchers at the Department of Computer Science and Engineering, University of South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National Institute of Standards and Technology (NIST). It contains approximately 60 hours of English broadcast news video data collected by LDC in 1998 and annotated for the 2005 VACE (Video Analysis and Content Extraction) tasks. The tasks covered by the broadcast news domain were human face (FDT) tracking, text strings (TDT) (glyphs rendered within the video image for the text object detection and tracking task) and word level text strings (TDT_Word_Level) (videotext OCR task). 

The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding. During VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects including faces, hands, people, vehicles and text in four primary video domains: broadcast news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial results were also obtained on automatic analysis of human activities and understanding of video sequences. 

Three performance evaluations were conducted under the auspices of the VACE program between 2004 and 2007. The 2005 evaluation was administered by USF in collaboration with NIST and guided by an advisory forum including the evaluation participants.

The broadcast news recordings were collected by LDC in 1998 from CNN Headline News (CNN-HDL) and ABC World News Tonight (ABC-WNT). CNN HDL is a 24-hour/day cable-TV broadcast which presents top news stories continuously throughout the day. ABC-WNT is a daily 30-minute news broadcast that typically covers about a dozen different news items. Each daily ABC-WNT broadcast and up to four 30-minute sections of CNN-HDL were recorded each day. The CNN segments were drawn from that portion of the daily schedule that happened to include closed captioning. 

2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News is distributed on one hard drive.2012 Subscription Members will automatically receive one copy of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$6000.
*

(2) 2009 CoNLL Shared Task Part 1 contains the Catalan, Czech, German and Spanish trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations, including the semantic dependencies model roles of both verbal and nominal predicates. 

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2008, the shared task focused on English and employed a unified dependency-based formalism and merged the task of syntactic dependency parsing and the task of identifying semantic arguments and labeling them with semantic roles; that data has been released by LDC as 2008 CoNLL Shared Task Data (LDC2009T12). The 2009 task extended the 2008 task to several languages (English plus Catalan, Chinese, Czech, German, Japanese and Spanish). Among the new features were comparison of time and space complexity based on participants' input, and learning curve comparison for languages with large datasets.
The 2009 shared task was divided into two subtasks:

(1) parsing syntactic dependencies

(2) identification of arguments and assignment of semantic roles for each predicate
The materials in this release consist of excerpts from the following corpora:
  • Ancora (Spanish + Catalan): 500,000 words each of annotated news text developed by the University of Barcelona, Polytechnic University of Catalonia, the University of Alacante and the University of the Basque Country
  • Prague Dependency Treebank 2.0 (Czech): approximately 2 million words of annotated news, journal and magazine text developed by Charles University; also available through LDC, LDC2006T01
  • TIGER Treebank + SALSA Corpus (German): approximately 900,000 words of annotated news text and FrameNet annotation developed by the University of Potsdam, Saarland University and the University of Stuttgart
2009 CoNLL Shared Task Part 1 is distributed on one DVD. 2012 Subscription Members will automatically receive two copies of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$200.  

*

(3) 2009 CoNLL Shared Task Part 2 contains the Chinese and English trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations, including the semantic dependencies model roles of both verbal and nominal predicates. 

The materials in this release consist of excerpts from the following corpora:
  • Penn Treebank II (LDC95T7) (English): over one million words of annotated English newswire and other text developed by the University of Pennsylvania
  • PropBank (LDC2004T14) (English): semantic annotation of newswire text from Treebank-2 developed by the University of Pennsylvania
  • NomBank (LDC2008T23) (English): argument structure for instances of common nouns in Treebank-2 and Treebank-3 (LDC99T42) texts developed by New York University
  • Chinese Treebank 6.0 (LDC2007T36)(Chinese): 780,000 words (over 1.28 million characters) of annotated Chinese newswire, magazine and administrative texts and transcripts from various broadcast news programs developed by the University of Pennsylvania and the University of Colorado
  • Chinese Proposition Bank 2.0 (LDC2008T07) (Chinese): predicate-argument annotation on 500,000 words from Chinese Treebank 6.0 developed by the University of Pennsylvania and the University of Colorado
2009 CoNLL Shared Task Part 2 is distributed on one CD. 2012 Subscription Members will automatically receive two copies of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$850.

*
(4) USC-SFI MALACH Interviews and Transcripts English was developed by The University of Southern California's Shoah Foundation Institute (USC-SFI), the University of Maryland, IBM and Johns Hopkins University as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 375 hours of interviews from 784 interviewees along with transcripts and other documentation.

Inspired by his experience making Schindler's List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. While most of those who gave testimony were Jewish survivors, the Foundation also interviewed homosexual survivors, Jehovah's Witness survivors, liberators and liberation witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes trials participants.  In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education. 

The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives; the focus was advancing the state of the art of automatic speech recognition (ASR) and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related co-articulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak. USC-SFI MALACH Interviews and Transcripts English was developed for the English speech recognition experiments. 

The speech data in this release was collected beginning in 1994 under a wide variety of conditions ranging from quiet to noisy (e.g., airplane over-flights, wind noise, background conversations and highway noise). Approximately 25,000 of all USC-SFI collected interviews are in English and average approximately 2.5 hours each. The 784 interviews included in this release are each a 30 minute section of the corresponding larger interview. The interviews include accented speech over a wide range (e.g., Hungarian, Italian, Yiddish, German and Polish). 

This release includes transcripts of the first 15 minutes of each interview. The transcripts were created using Transcriber 1.5.1 and later modified.

USC-SFI MALACH Interviews and Transcripts English is distributed on five DVDs. 2012 Subscription Members will automatically receive two copies of this data provided that they have submitted a completed copy of the User License Agreement for USC-SFI MALACH Interviews and Transcripts English (LDC2012S05). 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

Tuesday, March 20, 2012

LDC March 2012 Newsletter

2012 LDC Survey Responses and Benefit Winner

LDC at ICASSP 2012

New publications:

LDC2012T02
- English Translation Treebank: An Nahar Newswire -

LDC2012S04
- Malto Speech and Transcripts -


2012 LDC Survey Responses and Benefit Winner

Thanks to all who participated in the 2012 LDC Survey. Your responses were thoughtful and informative. We’re now analyzing the results; stay tuned for an announcement on the survey findings.

In the meantime, please join us in congratulating Todor Ganchev from the University of Patras, Wire Communications Laboratory (WCL) for winning the survey participation benefit! As a reminder, one $500 benefit was awarded to a blindly-selected participant whose response was received by February 7, 2012.

LDC at ICASSP 2012

LDC will be traveling across the globe to exhibit at its first IEEE-hosted event. The 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) will be held at the Kyoto International Conference Center in Kyoto, Japan, on March 25 - 30, 2012.

The ICASSP meeting is the world’s largest and most comprehensive technical conference focused on signal processing and its applications, and LDC is looking forward to interacting with members of this community. Please look for LDC’s exhibition at Booth #14 in the Annex Hall. We hope to see you there!

New Publications

(1) English Translation Treebank: An Nahar Newswire was developed by LDC and consists of 599 distinct newswire stories from the Lebanese publication An Nahar translated from Arabic to English and annotated for part-of-speech and syntactic structure.

This corpus is part of an ongoing effort at LDC to produce parallel Arabic and English treebanks. The guidelines followed for both part-of-speech and syntactic annotation are Penn Treebank II style, with changes in the tokenization of hyphenated words, part-of-speech and tree changes necessitated by those tokenization changes and revisions to the syntactic annotation to comply with the updated annotation guidelines (including the "Treebank-PropBank merge" or "Treebank IIa" and "treebank c" changes). The original Penn Treebank II guidelines, addenda describing changes to the guidelines and the tokenization specifications can be found on LDC's website.

The data consists of 461,489 tokens in 599 individual files. The news stories in this release were published in An Nahar in 2002.

The English sources files (translated from the Arabic) were automatically tokenized, part-of-speech tagged and parsed; the tokens, tags and parses were manually corrected. The quality control process consisted of a series of specific searches for over 100 types of potential inconsistency and parse or annotation error. Any errors found in those searches were manually corrected.

Annotations are in the following two formats:

  • Penn Style Trees
    • Bracketed tree files following the basic form (NODE (TAG token)). Each sentence is surrounded by a pair of empty parentheses.
  • AG xml
    • TreeEditor .xml stand-off annotation files. These files contain the POS and Treebank annotation and reference the source files by character offset. DTD files for the AG xml files were moved from their original location indicated in the readme to be more consistent with LDC publications.

English Translation Treebank: An Nahar Newswire is distributed via web download. 2012 Subscription Members will automatically receive two copies of this corpus on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$4500.

*

(2) Malto Speech and Transcripts was developed by Masato Kobayashi, Associate Professor in Linguistics at the University of Tokyo (Japan), and Bablu Tirkey, research scholar at the Tribal and Regional Languages Department, Ranchi University (India). It contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto.

Malto is a Dravidian language spoken in northeastern India (principally the states of Bihar, Jharkhand and West Bengal) and Bangladesh by people called the Pahariyas. Indian census data places the number of Malto speakers in a range of between 100,000-200,000 total speakers. Most Malto speakers live in the three northeastern districts of Jharkhand, i.e, Sahebganj, Godda and Pakur; the fieldwork that resulted in this corpus was conducted in those districts. Of the Pahariyas in that area, three subtribes, the Sawriya Pahariyas, the Mal Pahariyas and the Kumarbhag Pahariyas, primarily speak Malto.

The transcribed data accounts for 6 hours of the collection and contains 21 speakers (17 male, 4 female). The untranscribed data accounts for 2 hours of the collection and contains 10 speakers (9 male, 1 female). Four of the male speakers are present in both groups.

All audio is presented in .wav format. Each audio file name includes a subject number, village name, speaker name and the topic discussed. The transcripts and glossary are UTF-8 text files. Because of ambiguities that occur when writing Malto in Devenagari script, the transcripts were developed using Roman script with symbols adapted from the International Phonetic Alphabet (IPA) but are not considered phonetic transcripts.

Malto Speech and Transcripts is distributed on 1 DVD. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. The first 100 copies distributed to non-member organizations are available at no charge. Shipping and handling fees apply.

Wednesday, February 15, 2012

LDC February 2012 Newsletter

Spring 2012 LDC Data Scholarship Recipients! -

Membership Fee Savings and Publications Pipeline for MY2012 -

New publications:

LDC2012S03
- Digital Archive of Southern Speech (DASS)
-

LDC2012T01
- ModeS TimeBank 1.0
-



Spring 2012 LDC Data Scholarship Recipients!

LDC is pleased to announce the student recipients of the Spring 2012 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen six proposals to support. The following students will receive no-cost copies of LDC data:

Zainab Ali Khalaf – University of Science, Malaysia (Malaysia), graduate student, Computer Science. Zainab has been awarded a copy of 1996 English Broadcast News Transcripts (HUB4) (LDC97T22) for her work in spoken document retrieval.

Daniel Jettka – Trinity College Dublin (Ireland), graduate student, Centre for Language & Communication Studies. Daniel has been awarded copies of Penn Discourse Treebank Version 2.0 (LDC2008T05) and RST Discourse Treebank (LDC2002T07) for his work in anaphora resolution.

Olga Nickolaevna Ladoshko - National Technical University of Ukraine “KPI” (Ukraine), graduate student, Acoustics and Acoustoelectronics. Olga has been awarded copies of NTIMT (LDC93S2) and STC-TIMIT 1.0 (LDC2008S03) for her research in automatic speech recognition for Ukrainian.

Ming Yang, Xiaoxiao Ma, and Jiajia Huang – Wuhan University (China), graduate students, Computer Science. Ming, Xiaoxiao, and Jiajia have been awarded copies of ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07) and GALE Phase 1 Chinese Broadcast News Parallel Text – Part 1 (LDC2007T23) for their work in summarization and data mining.

Daria Vazhenina – University of Aizu (Japan), graduate student, Human Interface Lab. Daria has been awarded a copy of 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set (LDC2011S06) for her work in speaker diarization.

Tanina Zappone - University of Rome “La Sapienza” (Italy), graduate student, Oriental Studies. Tanina has been awarded a copy of Chinese Treebank 7.0 (LDC2010T07) for her work in China’s political communications.

Please join us in congratulating our student recipients! The next LDC Data Scholarship program is scheduled for the Fall 2012 semester.

Membership Fee Savings and Publications Pipeline for MY2012

Time is quickly running out to save on membership fees for MY2012! Any organization which joins or renews membership for 2012 through Thursday, March 1, 2012, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2011 can receive a 10% discount on fees provided they renew prior to March 1, 2012.

Many publications for MY2012 are still in development. The planned publications for the upcoming months include:

ARRAU (Anaphor Resolution and Underspecification) ~ data annotated for anaphoric relations, with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. The corpus contains texts from various genres: task-oriented dialogues from the TRAINS project, narratives from the English Pear Stories, and newspaper articles from the Wall Street Journal portion of the Penn Treebank.

MALACH English ~ over 300 hours of English audio recordings of interviews conducted under the auspices of the USC Shoah Foundation Institute for Visual History and Education and associated transcripts produced as part of the Multilingual Access to Large Spoken ArCHives (MALACH) project.

Malto Speech and Transcripts ~ speech files of Malto narratives recorded by Masato Kobayashi and Bablu Tirkey with associated transcripts. Malto is a Dravidian language spoken in northeastern India and Bangladesh.

NIST/USF Evaluation Resources for the VACE Program – Broadcast News ~ English broadcast news video annotated for the VACE (Video Analysis and Content Extraction) 2005 face, text and text word detection and tracking tasks.

OntoNotes 5.0 ~ multiple genres of English, Chinese, and Arabic text annotated for syntax, predicate argument structure and shallow semantics.

2012 Subscription Members are automatically sent all MY2012 data as it is released. 2012 Standard Members are entitled to request 16 corpora for free from MY2012. Non-members may license most data for research use.

New publications

(1) Digital Archive of Southern Speech (DASS) was developed by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in turn part of the Linguist Atlas Project (LAP). DASS contains approximately 370 hours of English speech data from 30 female speakers and 34 male speakers in .wav format and in .mp3 format, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations.

LAP consists of a set of survey research projects about the words and pronunciation of everyday American English, the largest project of its kind in the United States. Interviews with thousands of native speakers across the country have been carried out since 1929. LAGS surveyed the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews conducted from 1968-1983. Interviews average approximately six hours in length; the systematic LAGS tape archive amounts to 5500 hours of sound recordings. DASS is a collection of 64 interviews from LAGS selected to cover a range of speech across the region and to represent multiple education levels and ethnic backgrounds.

Also included in this release is a version of the LICHEN software developed at the University of Oulu, Finland. LICHEN allows users to browse and search through the audio data in a more advanced fashion using a graphical interface.

Digital Archive of Southern Speech (DASS) is distributed on one hard disc drive. 2012 Subscription Not-for-Profit/US Government Members will automatically receive one copy of this data. 2012 For-Profit Members will receive a copy provided that they have submitted a completed copy of the User License Agreement for Digital Archive of Southern Speech (LDC2012S03). 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$250.

*

(2) ModeS TimeBank 1.0 was developed by researchers at Technical University of Madrid and Barcelona Media and is a corpus of Modern Spanish (17th and 18th centuries) annotated with temporal and event information according to TimeML mark-up and annotated with spatial information following the SpatialML scheme.

TimeML (Pustejovsky et al., 2005) is a specification language for annotating eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. SpatialML (Mani et al., 2008) is a specification language for annotating and normalizing spatial expressions by means of geographic coordinates.

ModeS TimeBank 1.0 contains 102 documents reporting a sea-crossing cruise by a ship called La Princesa, which took place from December 1768 to April 1769. There exist copious logbooks from that period that not only provide information about shipping routes, but also contain valuable data concerning information flows, commercial agents and social networks.

All text is encoded in UTF-8. The data in ModeS TimeBank 1.0 has been tokenized, POS-tagged, and annotated with space, time and event information according to the TimeML and SpatialML specification schemes.

ModeS TimeBank 1.0 is distributed via web download. 2012 Subscription Members will automatically receive two copies of this corpus on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may request this data by completing a copy of the LDC User Agreement for Non-Members. The agreement can be faxed +1 215 573 2175 or scanned and emailed to this address. This data is available at no charge.

Friday, January 20, 2012

LDC January 2012 Newsletter

LDC Celebrates its 20th Anniversary!

2012 LDC Survey – Be on the Lookout!

Membership Discounts for MY 2012 Still Available

New publications:

LDC2012S01
- 2006 NIST Speaker Recognition Evaluation Test Set Part 2 -

LDC2012S02
- TORGO Database of Dysarthric Articulation -


LDC Celebrates its 20th Anniversary!

2012 marks LDC’s 20th Anniversary year – officially on April 15 – but this is cause for a yearlong celebration! From our founding in 1992 as a data repository and language resource distribution center, our online catalog has grown to include over 500 databases in 60 languages that have been licensed by over 3000 organizations from 80 different nations. This data has been made available through donations, funded projects at LDC or elsewhere, community initiatives, and from LDC resources, an indication of the collective strength of this consortium. LDC has evolved from an organization that shares language resources to one that also is at the forefront of language technology research that includes the development of new data resources, software tools, and standards and best practices.

As we celebrate throughout the year, look for announcements and special features in our newsletter and on our Facebook page.

2012 LDC Survey – Be on the Lookout!

It’s been four years since our last survey of LDC members and data licensees and we would like to again ask you to share your views on LDC and its language resources as well as your thoughts about data distribution in general and the impact of social media on language-related research and technology development. These topics are particularly timely as LDC enters its 20th anniversary year.

The 2012 LDC Survey will be sent to every person and organization that licensed LDC data and/or joined LDC as a Member during the period from 2009 through 2011. Those who complete the survey on or before February 7, 2012 will make their organization eligible for a $500 benefit to be applied to any corpus or membership purchase in 2012. LDC will conduct a blind drawing and one lucky winner will be chosen from the pool of respondents.

Many thanks for your continued support and for your participation in the 2012 Survey!

Membership Discounts for MY 2012 Still Available

If you are considering joining for Membership Year 2012 (MY2012), there is still time to save on membership fees. Any organization which joins or renews membership for 2012 through Thursday, March 1, 2012, is entitled to a 5% discount on membership fees. Organizations that held membership for MY2011 can receive a 10% discount on fees provided they renew prior to March 1, 2012. For further information on pricing, please consult our Announcements page or contact LDC.

New Publications

(1) 2006 NIST Speaker Recognition Evaluation Test Set Part 2 was developed by LDC and National Institute of Standards and Technology (NIST). It contains 568 hours of conversational telephone and microphone speech in English, Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu and associated English transcripts used as test data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE).

The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational telephone speech. The task was divided into 15 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the test conditions and additional documentation is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation Plan.

LDC has previously published 2006 NIST Speaker Recognition Evaluation Training Set and 2006 NIST Speaker Recognition Evaluation Test Set Part 1.

The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu.

The telephone speech segments are multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into four types: two-channel excerpts of approximately 10 seconds, two-channel conversations of approximately 5 minutes, summed-channel conversations also of approximately 5 minutes and a two-channel conversation with the usual telephone speech replaced by auxiliary microphone data in the putative target speaker channel. The auxiliary microphone conversations are also of approximately five minutes in length. English language transcripts in .ctm format were produced using an automatic speech recognition (ASR) system.

2006 NIST Speaker Recognition Evaluation Test Set Part 2 is distributed on seven DVDs. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

*

(2) TORGO Database of Dysarthric Articulation was developed by the University of Toronto's departments of Computer Science and Speech Language Pathology in collaboration with the Holland-Bloorview Kids Rehabilitation Hospital in Toronto, Canada. It contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS) and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.

CP and ALS are examples of dysarthria which is caused by disruptions in the neuro-motor interface that distort motor commands to the vocal articulators, resulting in atypical and relatively unintelligible speech in most cases. The TORGO database is primarily a resource for developing advanced automatic speaker recognition (ASR) models suited to the needs of people with dysarthria, but it is also applicable to non-dysarthric speech. The inability of modern ASR to effectively understand dysarthric speech is a problem since the more general physical disabilities often associated with the condition can make other forms of computer input, such as computer keyboards or touch screens, difficult to use.

The data consists of aligned acoustics and measured 3D articulatory features from the speakers carried out using the 3D AG500 electro-magnetic articulograph (EMA) system (Carstens Medizinelektronik GmbH, Lenglern, Germany) with fully-automated calibration. This system allows for 3D recordings of articulatory movements inside and outside the vocal tract, thus providing a detailed window on the nature and direction of speech-related activity.

All subjects read text consisting of non-words, short words and restricted sentences from a 19-inch LCD screen. The restricted sentences included 162 sentences from the sentence intelligibility section of Assessment of intelligibility of dysarthric speech (Yorkston & Beukelman, 1981) and 460 sentences derived from the TIMIT database. The unrestricted sentences were elicited by asking participants to spontaneously describe 30 images in interesting situations taken randomly from Webber Photo Cards - Story Starters (Webber, 2005), designed to prompt students to tell or write a story.

Data is organized by speaker and by the session in which each speaker recorded data. Each speaker's directory contains 'Session' directories which encapsulate data recorded in the respective visit and occasionally, a 'Notes' directory which can include Frenchay assessments (test for the measurement, description and diagnosis of dysarthria), notes about sessions (e.g., sensor errors), and other relevant notes.

TORGO Database of Dysarthric Articulation is distributed on 4 DVDs. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1200.

Thursday, December 15, 2011

LDC December 2011 Newsletter

Spring 2012 LDC Data Scholarship Program - deadline approaching!

LDC Exhibiting at LSA 2012 Annual Meeting

LDC Hosts Satellite Workshop at LSA 2012

LDC to Close for Winter Break

New publications

LDC2011S10
-
2006 NIST Speaker Recognition Evaluation Test Set Part 1 -

LDC2011S11
- 2008 NIST Speaker Recognition Evaluation Supplemental Set -


Spring 2012 LDC Data Scholarship Program - deadline fast approaching!

The deadline for the Spring 2012 LDC Data Scholarship Program is less than a month away! Applications are being accepted through January 15, 2012. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the
LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

LDC Exhibiting at LSA 2012 Annual Meeting

LDC looks forward to mingling with linguists and language specialists when we exhibit at the 86th Annual Meeting of the Linguistic Society of America (LSA). The main conference will be held over January 5-8, 2012 at the Portland, OR Hilton and Executive Tower and the exhibit hall will be open from January 6-8th (limited hours on Sunday the 8th). Please stop by our display for news on what 2012 will hold for LDC and to receive some of our conference giveaways.

LSA 2012 will feature plenary talks on the following topics:

  • Patrice Speeter Beddor (University of Michigan): "The Dynamics of Speech Perception: Constancy, Variation, and Change"
  • Dan Jurafsky (Stanford University): "Computing Meaning: Learning and Extracting Meaning from Text"
  • Ted Supalla (University of Rochester): "Rethinking the Emergence of Grammatical Structure in Signed Languages: New Evidence from Variation and Historical Change in American Sign Language"
For further information visit the LSA Annual Meeting website. If you would like to learn more about LDC’s conference preparations, please ‘like’ our Facebook page.

We hope to see you there!


LDC Hosts Satellite Workshop at LSA 2012

LDC will co-host a satellite workshop entitled 'Sociolinguistic Archival Preparation' on January 4-5, 2012 in conjunction with the LSA 2012 Annual Meeting. This two-day workshop will focus on techniques to permit the archiving of data, for cross-community sharing of corpora as well as for subsequent 'panel' studies. Recent discussions within the field have concluded that present protocols need to be expanded to permit adequate archiving. Specifically:

  • Institutional Review Board (IRB) paperwork needs to be adapted to provide protection for interviewees while permitting their speech data to be more generally sharable (and therefore archiveable);
  • Demographic, situational, and attitudinal protocols are needed to provide a unified resource serving multiple research communities as well as the contributing researchers.

The sooner IRB forms and research protocols are aligned with each other, the sooner sharable, archiveable corpora will become available, permitting intergroup comparison and interdisciplinary collaboration.

LDC's Executive Director, Christopher Cieri, and LDC consultant and University of Arizona scholar, Malcah Yaeger-Dror, are the workshop organizers. This workshop is funded in part by the National Science Foundation (BCS#1144480). Further information about the workshop is available on the LSA Annual Meeting website.

LDC to Close for Winter Break

LDC will be closed from Monday, December 26, 2011 through Monday, January 2, 2012 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Tuesday, January 3, 2012. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.

Best wishes for a happy and safe holiday season!

New Publications

(1) 2006 NIST Speaker Recognition Evaluation Test Set Part 1 was developed by LDC and National Institute of Standards and Technology (NIST). It contains 437 hours of conversational telephone and microphone speech in English, Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu and associated English transcripts used as test data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE).

The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational telephone speech. The task was divided into 15 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the test conditions and additional documentation is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation Plan.

The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu.

The telephone speech segments are multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into four types: two-channel excerpts of approximately 10 seconds, two-channel conversations of approximately 5 minutes, summed-channel conversations also of approximately 5 minutes and a two-channel conversation with the usual telephone speech replaced by auxiliary microphone data in the putative target speaker channel. The auxiliary microphone conversations are also of approximately five minutes in length.

English language transcripts in .ctm format were produced using an automatic speech recognition (ASR) system.

2006 NIST Speaker Recognition Evaluation Test Set Part 1 is distributed on five DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

*

(2) 2008 NIST Speaker Recognition Evaluation Supplemental Set was developed by LDC and National Institute of Standards and Technology (NIST) and contains additional data distributed after the main 2008 Speaker Recognition Evaluation (SRE). Specifically, the corpus consists of 770 hours of English microphone speech along with transcripts and other materials used as supplemental data in the 2008 NIST Speaker Recognition Evaluation (SRE) and in a follow-up evaluation to SRE08.

The 2008 evaluation was distinguished from prior evaluations by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario. The follow-up evaluation focused on speaker detection in the context of conversational interview type speech and was designed to measure the performance of SRE08 systems in previously unexposed test segment channel conditions.

LDC previously released the main 2008 NIST SRE Evaluation in three parts as 2008 NIST Speaker Recognition Evaluation Training Set Part 1 LDC2011S05, 2008 NIST Speaker Recognition Evaluation Training Set Part 2 LDC2011S07 and 2008 NIST Speaker Recognition Evaluation Test Set LDC2011S08.

The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English and bilingual English speakers. The microphone speech in this corpus is in English and consists of approximately 3 minute and 30 minute interview excerpts.

This supplemental data is split into four different parts which provide:

  • new training data distributed to 2008 SRE participants
  • additional data distributed to participants in the 2008 SRE follow-up evaluation
  • interviewer channel files for the 2008 SRE main test (released after the evaluations)
  • supplemental training data (released after the evaluations)

English language transcripts in .cfm format were produced using an automatic speech recognition (ASR) system and are included for some, but not all, speech data.

2008 NIST Speaker Recognition Evaluation Supplemental Set is distributed on five DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.