Wednesday, July 18, 2012

LDC July 2012 Newsletter

 
New publications:



LDC2012T10
Catalan TimeBank 1.0  -

LDC 20th Anniversary Workshop 

LDC announces its 20th Anniversary Workshop on Language Resources, to be held in Philadelphia on September 6-7, 2012. The event will commemorate our anniversary, reflect on the beginning of language data centers and address the future of language resources. 

Workshop themes will include: the developments in human language technologies and associated resources that have brought us to our current state; the language resources required by the technical approaches taken and the impact of these resources on HLT progress; the applications of HLT and resources to other disciplines including law, medicine, economics, the political sciences and psychology; the impact of HLTs and related technologies on linguistic analysis and novel approaches in fields as widespread as phonetics, semantics, language documentation, sociolinguistics and dialect geography; and finally, the impact of any of these developments on the ways in which language resources are created, shared and exploited and on the specific resources required.

Stay tuned for further details.
New publications 

(1) American English Nickname Collection was developed by Intelius, Inc. and is a compilation of American English nicknames to given name mappings based on information in US government records, public web profiles and financial and property reports. This corpus is intended as a tool for the quantitative study of nickname usage in the United States such as in demographic and sociological studies. 

The American English Nickname Collection contains 331,237 distinct mappings encompassing millions of names. The data was collected and processed through a record linkage pipeline. The steps in the pipeline were (1) data cleaning, (2) blocking, (3) pair-wise linkage and (4) clustering. In the cleaning step, material was categorized, processed to remove junk and spam records and normalized to an approximately common representation. The blocking process utilized an algorithm to group records by shared properties for determining which record pairs should be examined by the pairwise linker as potential duplicates. The linkage step assigned a score to record pairs using a supervised pairwise-based machine learning model. The clustering step combined record pairs into connected components and further partitioned each connected component to remove inconsistent pairwise links. The result is that input records were partitioned into disjoint sets called profiles, where each profile corresponded to a single person.

The material is presented in the form of a comma delimited text file. Each line contains a first name, a nickname or alias, its conditional probability and its frequency. The conditional probability for each nickname is derived from the base data using an algorithm which calculates both the probability for which any alias refers to a given name and a threshold below which the mapping is most likely an error. This threshold eliminates typographic errors and other noise from the data.

American English Nickname Collection is distributed via web download. 2012 Subscription Members will receive two copies of this data on disc provided that they have submitted a completed copy of the User License Agreement for American English Nickname Collection (LDC2012T11). 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data by completing the User License Agreement for American English Nickname Collection (LDC2012T11). The agreement can be faxed to +1 215 573 2175 or scanned and emailed to ldc @ ldc . upenn . edu. The collection is being made available at no charge.

*

(2) Arabic Treebank - Broadcast News v1.0 was developed at LDC. It consists of 120 transcribed Arabic broadcast news stories with part-of-speech, morphology, gloss and syntactic tree annotation in accordance with the Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines. The ongoing PATB project supports research in Arabic-language natural language processing and human language technology development. 

This release contains 432,976 source tokens before clitics were split, and 517,080 tree tokens after clitics were separated for treebank annotation. The source materials are Arabic broadcast news stories collected by LDC during the period 2005-2008 from the following sources: Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya TV, Al Fayha, Alhurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiyah, Dubai TV, Kuwait TV, Lebanese Broadcasting Corp., Oman TV, Radio Sawa, Saudi TV and Syria TV. The transcripts were produced by LDC.

Arabic Treebank - Broadcast News v1.0 is distributed via web download. 2012 Subscription Members will receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora.
*

(3) Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Catalan texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language

TimeML is a schema for annotating eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. Catalan Timebank 1.0 is annotated in three levels, marking events, time expressions and event metadata. The TimeML annotation scheme was tailored for the specifics of the Catalan language. Temporal relations in Catalan present distinctions of verbal mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent in English. 

Catalan TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are from the EFE news agency, the ACN Catalan news agency2 and the Catalan version of the El Períodico newspaper, and span the period from January to December 2000. 

The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including structure, syntax, dependencies, semantics and pragmatics. That information is not included in this release, but it can be mapped to the present annotations. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC)".

Catalan TimeBank 1.0 is distributed by web download. 2012 Subscription Members will receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data by completing the LDC User Agreement for Non-members.  The agreement can be faxed to +1 215 573 2175 or scanned and emailed to  ldc @ ldc . upenn . edu. The collection is being made available at no charge.

Monday, June 18, 2012

LDC June 2012 Newsletter

New publications:



LDC at LREC 2012

LDC attended the 8th Language Resource Evaluation Conference (LREC2012), hosted by ELRA, the European Language Resource Association. The conference was held in Istanbul, Turkey and featured a broad range of sessions on language resource and human language technologies research. Fourteen LDC staff members presented current work on a wide range of topics, including handwriting recognition, word alignment, treebanks, machine translation and information retrieval as well as initiatives for synchronizing metadata practices in sociolinguistic data collection.

The LDC Papers page now includes research papers presented at LREC 2012.  Most papers are available for download in pdf format; presentations slides and posters are available for several papers as well. On the Papers page, you can read about LDC's role in resource creation to support handwriting recognition and translation technology (Song et al 2012). LDC is developing resources to support two research programs:  Multilingual Automatic Document Classification, Analysis and Translations (MADCAT) and Open Handwriting Recognition and Translation (OpenHaRT). To support these programs, LDC is collecting handwritten samples of pre-processed Arabic and Chinese data that had previously been translated into English. To date, LDC has collected and annotated over 225,000 handwriting images.

Additionally, you can learn about LDC's efforts to collect and annotate very large corpora of user-contributed content in multiple languages (Garland et al, 2012). For the Broad Operational Language Translation (BOLT) program, LDC is developing resources to support genre-independent machine translation and information retrieval systems. In the current phase of BOLT, LDC is collecting and annotating threaded posts from online discussion forums, targeting at least 500 millions words each in three languages:  English, Chinese, and Egyptian Arabic. A portion of the data undergoes manual, multi-layered linguistic annotation.

As we mark LDC's 20th anniversary, we will feature the work behind these LREC papers as well as other ongoing research in upcoming newsletters.

New publications

(1) Arabic-Dialect/English Parallel Text was developed by Raytheon BBN Technologies (BBN), LDC and Sakhr Software and contains approximately 3.5 million tokens of Arabic dialect sentences and their English translations. 

The data in this corpus consists of Arabic web text as follows:

1. Filtered automatically from large Arabic text corpora harvested from the web by LDC. The LDC corpora consisted largely of weblog and online user groups and amounted to around 350 million Arabic words. Documents that contained a large percentage of non-Arabic or Modern Standard Arabic (MSA) words were eliminated. A list of dialect words was manually selected by culling through the Levantine Fisher (LDC2005S07, LDC2005T03, LDC2007S02 and LDC2007T04) and Egyptian CALLHOME speech corpora (LDC97S45, LDC2002S37, LDC97T19 and LDC2002T38) distributed by LDC. That list was then used to retain documents that contained a certain number of matches. The resulting subset of the web corpora contained around four million words. Documents were automatically segmented into passages using formatting information from the raw data.

2. Manually harvested by Sakhr Software from Arabic dialect web sites.

Dialect classification and sentence segmentation, as needed, and translation into English were performed by BBN through Amazon's Mechanical Turk. Arabic annotators from Mechanical Turk classified filtered passages as being either MSA or one of four regional dialects: Egyptian, Levantine, Gulf/Iraqi or Maghrebi. An additional "General" dialect option was allowed for ambiguous passages. The classification was applied to whole passages rather than individual sentences. Only the passages labeled Levantine and Egyptian were further processed. The segmented Levantine and Egyptian sentences were then translated. Annotators were instructed to translate completely and accurately and to transliterate Arabic names. They were also provided with examples. All segments of a passage were presented in the same translation task to provide context.
Arabic-Dialect/English Parallel Text is distributed via web download. 2012 Subscription Members will automatically receive two copies of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2250.
*

(2) Prague Czech-English Dependency Treebank (PCEDT) 2.0 was developed by the Institute of Formal and Applied Linguistics at Charles University in Prague, Czech Republic. It is a corpus of Czech-English parallel resources translated, aligned and manually annotated for dependency structure, semantic labeling, argument structure, ellipsis and anaphora resolution. This release updates Prague Czech-English Dependency Treebank 1.0 (LDC2004T25) by adding English newswire texts so that it now contains over two million words in close to 100,000 sentences. 

The principal new material in PCEDT 2.0 is the inclusion of the entire Wall Street Journal data from Treebank-3 (LDC99T42). Not included from PCEDT 1.0 are the Reader's Digest material,  the Czech monolingual corpus and  the English-Czech dictionary. Each section is enhanced with a comprehensive manual linguistic annotation in the Prague Dependency Treebank style (LDC2006T01), Prague Dependency Treebank 2.0). The main features of this annotation style are:
-dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
-semantic labeling of content words and types of coordinating structures
-argument structure, including an argument structure ("valency") lexicon for both languages
-ellipsis and anaphora resolution
This annotation style is called tectogrammatical annotation, and it constitutes the tectogrammatical layer in the corpus. Please consult the PCEDT website for more information and documentation.Prague Czech-English Dependency Treebank (PCEDT) 2.0 is distributed on one DVD. 2012 Subscription Members will automatically receive two copies of this data.  2012 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$100.

Wednesday, May 16, 2012

LDC May 2012 Newsletter

 
New publications:





To date almost 100 organizations have joined for Membership Year (MY) 2012, our 20th anniversary year.   Once again LDC's early renewal discount program has resulted in significant savings for our members. Organizations that renewed membership or joined early for MY2012 saved almost US$60,000! MY 2011 members are still eligible for a 5% discount when renewing for MY2012. This discount will apply throughout 2012, regardless of time of renewal.

Organizations joining LDC can take advantage of membership benefits including free membership year data as well as discounts on older LDC corpora. Please visit our Members FAQ for further information.

New Publications
(1) Chinese Dependency Treebank 1.0 was developed by the Harbin Institute of Technology's Research Center for Social Computing and Information Retrieval (HIT-SCIR). It contains 49,996 Chinese sentences (902,191 words) randomly selected from People's Daily newswire stories published between 1992 and 1996 and annotated with syntactic dependency structures. Ill-formed or short sentences were eliminated from the randomly-selected sentences prior to annotation. The data was segmented and annotated for part of speech (POS), syntactic structures, verb subclasses and noun compounds. Word segmentation and POS tagging were accomplished automatically using statistical models trained on a larger, annotated corpus of People's Daily newswire stories. Humans manually annotated the syntactic structures and corrected word segmentation errors. POS tags were not corrected.

The data is provided in the format of CoNLL-X and in UTF-8. Chinese Dependency Treebank 1.0 is distributed via web download. 2012 Subscription Members will automatically receive one copy of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$300.
*

(2) GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised machine translation training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from broadcast conversation (BC) data collected by LDC between 2004 and 2007 and transcribed by LDC or under its direction. 

GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 includes 36 source-translation document pairs, comprising 169,109 words of Arabic source text and its English translation. Data is drawn from thirteen distinct Arabic programs broadcast between 2004 and 2007 from the following sources: Al Alam News Channel, Aljazeera, Dubai TV, Oman TV, and Radio Sawa. Broadcast conversation programming is generally more interactive than traditional news broadcasts and includes talk shows, interviews, call-in programs and roundtable discussions. The programs in this release focus on current events topics. 

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines which are included with this release. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. All data are encoded in UTF8. GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 is distributed via web download. 2012 Subscription Members will automatically receive one copy of this data on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1750.
*

(3) Turkish Broadcast News Speech and Transcripts was developed by Boğaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval. 

The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio; the 2009 broadcasts were recorded from digital satellite transmissions. A quick manual segmentation and transcription approach was followed.

The data was recorded at 32 kHz and re-sampled at 16 kHz. After screening for recording quality, the files were segmented, transcribed, and verified. The segmentation occurred in two steps, an initial automatic segmentation followed by manual correction and annotation which included information such as background conditions and speaker boundaries. 

The transcription guidelines were adapted from the LDC HUB4 and quick transcription guidelines. An English version of the adapted guidelines is provided with the data. Manual segmentation and transcripts were created by native Turkish speakers at Boğaziçi University using Transcriber. The transcriptions are provided in the ISO-8859-9 (Latin5) character set.

Turkish Broadcast News Speech and Transcripts is distributed on four DVDs. 2012 Subscription Members will automatically receive one copy of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

Friday, April 20, 2012

LDC April 2012 Newsletter

 



LDC Timeline – Two Decades of Milestones
April 15 marks the “official” 20th anniversary of LDC’s founding. We’ll be featuring highlights from the last two decades in upcoming newsletters, on the web and elsewhere.  For a start, here’s a brief timeline of significant milestones.
  • 1992: The University of Pennsylvania is chosen as the host site for LDC in response to a call for proposals issued by DARPA; the mission of the new consortium is to operate as a specialized data publisher and archive guaranteeing widespread, long-term availability of language resources. DARPA provides seed money with the stipulation that LDC become self-sustaining within five years. Mark Liberman assumes duties as LDC’s Director with a staff that grows to four, including Jack Godfrey, the Consortium’s first Executive Director.
  • 1993: LDC’s catalog debuts. Early releases include benchmark data sets such as TIMIT, TIPSTER, CSR and Switchboard, shortly followed by the Penn Treebank. 
  • 1994: LDC and NIST (the National Institute of Standards and Technology) enter into a Cooperative R&D Agreement that provides the framework for the continued collaboration between the two organizations.
  • 1995: Collection of conversational telephone speech and broadcast programming and transcription commences. LDC begins its long and continued support for NIST common task evaluations by providing custom data sets for participants. Membership and data license fees prove sufficient to support LDC operations, satisfying the requirement that the Consortium be self-sustaining.
  • 1997: LDC announces LDC Online, a searchable index of newswire and speech data with associated tools to compute n-gram models, mutual information and other analyses.
  • 1998: LDC adds annotation to its task portfolio. Christopher Cieri joins LDC as Executive Director and develops the annotation operation.
  • 1999: Steven Bird joins LDC; the organization begins to develop tools and best practices for general use. The Annotation Graph Toolkit results from this effort.
  • 2000: LDC expands its support of common task evaluations from providing corpora to coordinating language resources across the program. Early examples include the DARPA TIDES, EARS and GALE programs.
  • 2001: The Arabic treebank project begins.
  • 2002: LDC moves to its current facilities at 3600 Market Street, Philadelphia with a full-time staff of approximately 40 persons.
  • 2004: LDC introduces the Standard and Subscription membership options, allowing members to choose whether to receive all or a subset of the data sets released in a membership year.
  • 2005: LDC makes task specifications and guidelines available through its projects web pages.
  • 2008: LDC introduces programs that provide discounts for continuing members and those who renew early in the year.
  • 2010: LDC inaugurates the Data Scholarship program for students with a demonstrable need for data.
  • 2012: LDC’s full-time staff of 50 and 196 part-time staff support ongoing projects and operations which include collecting, developing and archiving data, data annotation, tool development, sponsored-project support and multiple collaborations with various partners. The general catalog contains over 500 holdings in more than 50 languages. Over 85,000 copies of more than 1300 titles have been distributed to 3200 organizations in 70 countries. 

New Publications

(1) 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News was developed by researchers at the Department of Computer Science and Engineering, University of South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National Institute of Standards and Technology (NIST). It contains approximately 60 hours of English broadcast news video data collected by LDC in 1998 and annotated for the 2005 VACE (Video Analysis and Content Extraction) tasks. The tasks covered by the broadcast news domain were human face (FDT) tracking, text strings (TDT) (glyphs rendered within the video image for the text object detection and tracking task) and word level text strings (TDT_Word_Level) (videotext OCR task). 

The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding. During VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects including faces, hands, people, vehicles and text in four primary video domains: broadcast news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial results were also obtained on automatic analysis of human activities and understanding of video sequences. 

Three performance evaluations were conducted under the auspices of the VACE program between 2004 and 2007. The 2005 evaluation was administered by USF in collaboration with NIST and guided by an advisory forum including the evaluation participants.

The broadcast news recordings were collected by LDC in 1998 from CNN Headline News (CNN-HDL) and ABC World News Tonight (ABC-WNT). CNN HDL is a 24-hour/day cable-TV broadcast which presents top news stories continuously throughout the day. ABC-WNT is a daily 30-minute news broadcast that typically covers about a dozen different news items. Each daily ABC-WNT broadcast and up to four 30-minute sections of CNN-HDL were recorded each day. The CNN segments were drawn from that portion of the daily schedule that happened to include closed captioning. 

2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News is distributed on one hard drive.2012 Subscription Members will automatically receive one copy of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for US$6000.
*

(2) 2009 CoNLL Shared Task Part 1 contains the Catalan, Czech, German and Spanish trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations, including the semantic dependencies model roles of both verbal and nominal predicates. 

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2008, the shared task focused on English and employed a unified dependency-based formalism and merged the task of syntactic dependency parsing and the task of identifying semantic arguments and labeling them with semantic roles; that data has been released by LDC as 2008 CoNLL Shared Task Data (LDC2009T12). The 2009 task extended the 2008 task to several languages (English plus Catalan, Chinese, Czech, German, Japanese and Spanish). Among the new features were comparison of time and space complexity based on participants' input, and learning curve comparison for languages with large datasets.
The 2009 shared task was divided into two subtasks:

(1) parsing syntactic dependencies

(2) identification of arguments and assignment of semantic roles for each predicate
The materials in this release consist of excerpts from the following corpora:
  • Ancora (Spanish + Catalan): 500,000 words each of annotated news text developed by the University of Barcelona, Polytechnic University of Catalonia, the University of Alacante and the University of the Basque Country
  • Prague Dependency Treebank 2.0 (Czech): approximately 2 million words of annotated news, journal and magazine text developed by Charles University; also available through LDC, LDC2006T01
  • TIGER Treebank + SALSA Corpus (German): approximately 900,000 words of annotated news text and FrameNet annotation developed by the University of Potsdam, Saarland University and the University of Stuttgart
2009 CoNLL Shared Task Part 1 is distributed on one DVD. 2012 Subscription Members will automatically receive two copies of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$200.  

*

(3) 2009 CoNLL Shared Task Part 2 contains the Chinese and English trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations, including the semantic dependencies model roles of both verbal and nominal predicates. 

The materials in this release consist of excerpts from the following corpora:
  • Penn Treebank II (LDC95T7) (English): over one million words of annotated English newswire and other text developed by the University of Pennsylvania
  • PropBank (LDC2004T14) (English): semantic annotation of newswire text from Treebank-2 developed by the University of Pennsylvania
  • NomBank (LDC2008T23) (English): argument structure for instances of common nouns in Treebank-2 and Treebank-3 (LDC99T42) texts developed by New York University
  • Chinese Treebank 6.0 (LDC2007T36)(Chinese): 780,000 words (over 1.28 million characters) of annotated Chinese newswire, magazine and administrative texts and transcripts from various broadcast news programs developed by the University of Pennsylvania and the University of Colorado
  • Chinese Proposition Bank 2.0 (LDC2008T07) (Chinese): predicate-argument annotation on 500,000 words from Chinese Treebank 6.0 developed by the University of Pennsylvania and the University of Colorado
2009 CoNLL Shared Task Part 2 is distributed on one CD. 2012 Subscription Members will automatically receive two copies of this data. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$850.

*
(4) USC-SFI MALACH Interviews and Transcripts English was developed by The University of Southern California's Shoah Foundation Institute (USC-SFI), the University of Maryland, IBM and Johns Hopkins University as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 375 hours of interviews from 784 interviewees along with transcripts and other documentation.

Inspired by his experience making Schindler's List, Steven Spielberg established the Survivors of the Shoah Visual History Foundation in 1994 to gather video testimonies from survivors and other witnesses of the Holocaust. While most of those who gave testimony were Jewish survivors, the Foundation also interviewed homosexual survivors, Jehovah's Witness survivors, liberators and liberation witnesses, political prisoners, rescuers and aid providers, Roma and Sinti (Gypsy) survivors, survivors of eugenics policies, and war crimes trials participants.  In 2006, the Foundation became part of the Dana and David Dornsife College of Letters, Arts and Sciences at the University of Southern California in Los Angeles and was renamed as the USC Shoah Foundation Institute for Visual History and Education. 

The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives; the focus was advancing the state of the art of automatic speech recognition (ASR) and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related co-articulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. The work centered on five languages: English, Czech, Russian, Polish and Slovak. USC-SFI MALACH Interviews and Transcripts English was developed for the English speech recognition experiments. 

The speech data in this release was collected beginning in 1994 under a wide variety of conditions ranging from quiet to noisy (e.g., airplane over-flights, wind noise, background conversations and highway noise). Approximately 25,000 of all USC-SFI collected interviews are in English and average approximately 2.5 hours each. The 784 interviews included in this release are each a 30 minute section of the corresponding larger interview. The interviews include accented speech over a wide range (e.g., Hungarian, Italian, Yiddish, German and Polish). 

This release includes transcripts of the first 15 minutes of each interview. The transcripts were created using Transcriber 1.5.1 and later modified.

USC-SFI MALACH Interviews and Transcripts English is distributed on five DVDs. 2012 Subscription Members will automatically receive two copies of this data provided that they have submitted a completed copy of the User License Agreement for USC-SFI MALACH Interviews and Transcripts English (LDC2012S05). 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

Tuesday, March 20, 2012

LDC March 2012 Newsletter

New publications:

2012 LDC Survey Responses and Benefit Winner
Thanks to all who participated in the 2012 LDC Survey. Your responses were thoughtful and informative. We’re now analyzing the results; stay tuned for an announcement on the survey findings.
In the meantime, please join us in congratulating Todor Ganchev from the University of Patras, Wire Communications Laboratory (WCL) for winning the survey participation benefit! As a reminder, one $500 benefit was awarded to a blindly-selected participant whose response was received by February 7, 2012.
LDC at ICASSP 2012
LDC will be traveling across the globe to exhibit at its first IEEE-hosted event. The 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) will be held at the Kyoto International Conference Center in Kyoto, Japan, on March 25 - 30, 2012.
The ICASSP meeting is the world’s largest and most comprehensive technical conference focused on signal processing and its applications, and LDC is looking forward to interacting with members of this community. Please look for LDC’s exhibition at Booth #14 in the Annex Hall. We hope to see you there!
New Publications
(1) English Translation Treebank: An Nahar Newswire was developed by LDC and consists of 599 distinct newswire stories from the Lebanese publication An Nahar translated from Arabic to English and annotated for part-of-speech and syntactic structure.
This corpus is part of an ongoing effort at LDC to produce parallel Arabic and English treebanks. The guidelines followed for both part-of-speech and syntactic annotation are Penn Treebank II style, with changes in the tokenization of hyphenated words, part-of-speech and tree changes necessitated by those tokenization changes and revisions to the syntactic annotation to comply with the updated annotation guidelines (including the "Treebank-PropBank merge" or "Treebank IIa" and "treebank c" changes). The original Penn Treebank II guidelines, addenda describing changes to the guidelines and the tokenization specifications can be found on LDC's website.
The data consists of 461,489 tokens in 599 individual files. The news stories in this release were published in An Nahar in 2002.
The English sources files (translated from the Arabic) were automatically tokenized, part-of-speech tagged and parsed; the tokens, tags and parses were manually corrected. The quality control process consisted of a series of specific searches for over 100 types of potential inconsistency and parse or annotation error. Any errors found in those searches were manually corrected.
Annotations are in the following two formats:
  • Penn Style Trees
    • Bracketed tree files following the basic form (NODE (TAG token)). Each sentence is surrounded by a pair of empty parentheses.
  • AG xml
    • TreeEditor .xml stand-off annotation files. These files contain the POS and Treebank annotation and reference the source files by character offset. DTD files for the AG xml files were moved from their original location indicated in the readme to be more consistent with LDC publications.
English Translation Treebank: An Nahar Newswire is distributed via web download. 2012 Subscription Members will automatically receive two copies of this corpus on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$4500.
*
(2) Malto Speech and Transcripts was developed by Masato Kobayashi, Associate Professor in Linguistics at the University of Tokyo (Japan), and Bablu Tirkey, research scholar at the Tribal and Regional Languages Department, Ranchi University (India). It contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto.
Malto is a Dravidian language spoken in northeastern India (principally the states of Bihar, Jharkhand and West Bengal) and Bangladesh by people called the Pahariyas. Indian census data places the number of Malto speakers in a range of between 100,000-200,000 total speakers. Most Malto speakers live in the three northeastern districts of Jharkhand, i.e, Sahebganj, Godda and Pakur; the fieldwork that resulted in this corpus was conducted in those districts. Of the Pahariyas in that area, three subtribes, the Sawriya Pahariyas, the Mal Pahariyas and the Kumarbhag Pahariyas, primarily speak Malto.
The transcribed data accounts for 6 hours of the collection and contains 21 speakers (17 male, 4 female). The untranscribed data accounts for 2 hours of the collection and contains 10 speakers (9 male, 1 female). Four of the male speakers are present in both groups.
All audio is presented in .wav format. Each audio file name includes a subject number, village name, speaker name and the topic discussed. The transcripts and glossary are UTF-8 text files. Because of ambiguities that occur when writing Malto in Devenagari script, the transcripts were developed using Roman script with symbols adapted from the International Phonetic Alphabet (IPA) but are not considered phonetic transcripts.
Malto Speech and Transcripts is distributed on 1 DVD. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. The first 100 copies distributed to non-member organizations are available at no charge. Shipping and handling fees apply.

Wednesday, February 15, 2012

LDC February 2012 Newsletter

Spring 2012 LDC Data Scholarship Recipients! -

Membership Fee Savings and Publications Pipeline for MY2012 -

New publications:

LDC2012S03
- Digital Archive of Southern Speech (DASS)
-

LDC2012T01
- ModeS TimeBank 1.0
-



Spring 2012 LDC Data Scholarship Recipients!

LDC is pleased to announce the student recipients of the Spring 2012 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen six proposals to support. The following students will receive no-cost copies of LDC data:

Zainab Ali Khalaf – University of Science, Malaysia (Malaysia), graduate student, Computer Science. Zainab has been awarded a copy of 1996 English Broadcast News Transcripts (HUB4) (LDC97T22) for her work in spoken document retrieval.

Daniel Jettka – Trinity College Dublin (Ireland), graduate student, Centre for Language & Communication Studies. Daniel has been awarded copies of Penn Discourse Treebank Version 2.0 (LDC2008T05) and RST Discourse Treebank (LDC2002T07) for his work in anaphora resolution.

Olga Nickolaevna Ladoshko - National Technical University of Ukraine “KPI” (Ukraine), graduate student, Acoustics and Acoustoelectronics. Olga has been awarded copies of NTIMT (LDC93S2) and STC-TIMIT 1.0 (LDC2008S03) for her research in automatic speech recognition for Ukrainian.

Ming Yang, Xiaoxiao Ma, and Jiajia Huang – Wuhan University (China), graduate students, Computer Science. Ming, Xiaoxiao, and Jiajia have been awarded copies of ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07) and GALE Phase 1 Chinese Broadcast News Parallel Text – Part 1 (LDC2007T23) for their work in summarization and data mining.

Daria Vazhenina – University of Aizu (Japan), graduate student, Human Interface Lab. Daria has been awarded a copy of 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set (LDC2011S06) for her work in speaker diarization.

Tanina Zappone - University of Rome “La Sapienza” (Italy), graduate student, Oriental Studies. Tanina has been awarded a copy of Chinese Treebank 7.0 (LDC2010T07) for her work in China’s political communications.

Please join us in congratulating our student recipients! The next LDC Data Scholarship program is scheduled for the Fall 2012 semester.

Membership Fee Savings and Publications Pipeline for MY2012

Time is quickly running out to save on membership fees for MY2012! Any organization which joins or renews membership for 2012 through Thursday, March 1, 2012, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2011 can receive a 10% discount on fees provided they renew prior to March 1, 2012.

Many publications for MY2012 are still in development. The planned publications for the upcoming months include:

ARRAU (Anaphor Resolution and Underspecification) ~ data annotated for anaphoric relations, with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. The corpus contains texts from various genres: task-oriented dialogues from the TRAINS project, narratives from the English Pear Stories, and newspaper articles from the Wall Street Journal portion of the Penn Treebank.

MALACH English ~ over 300 hours of English audio recordings of interviews conducted under the auspices of the USC Shoah Foundation Institute for Visual History and Education and associated transcripts produced as part of the Multilingual Access to Large Spoken ArCHives (MALACH) project.

Malto Speech and Transcripts ~ speech files of Malto narratives recorded by Masato Kobayashi and Bablu Tirkey with associated transcripts. Malto is a Dravidian language spoken in northeastern India and Bangladesh.

NIST/USF Evaluation Resources for the VACE Program – Broadcast News ~ English broadcast news video annotated for the VACE (Video Analysis and Content Extraction) 2005 face, text and text word detection and tracking tasks.

OntoNotes 5.0 ~ multiple genres of English, Chinese, and Arabic text annotated for syntax, predicate argument structure and shallow semantics.

2012 Subscription Members are automatically sent all MY2012 data as it is released. 2012 Standard Members are entitled to request 16 corpora for free from MY2012. Non-members may license most data for research use.

New publications

(1) Digital Archive of Southern Speech (DASS) was developed by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in turn part of the Linguist Atlas Project (LAP). DASS contains approximately 370 hours of English speech data from 30 female speakers and 34 male speakers in .wav format and in .mp3 format, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations.

LAP consists of a set of survey research projects about the words and pronunciation of everyday American English, the largest project of its kind in the United States. Interviews with thousands of native speakers across the country have been carried out since 1929. LAGS surveyed the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews conducted from 1968-1983. Interviews average approximately six hours in length; the systematic LAGS tape archive amounts to 5500 hours of sound recordings. DASS is a collection of 64 interviews from LAGS selected to cover a range of speech across the region and to represent multiple education levels and ethnic backgrounds.

Also included in this release is a version of the LICHEN software developed at the University of Oulu, Finland. LICHEN allows users to browse and search through the audio data in a more advanced fashion using a graphical interface.

Digital Archive of Southern Speech (DASS) is distributed on one hard disc drive. 2012 Subscription Not-for-Profit/US Government Members will automatically receive one copy of this data. 2012 For-Profit Members will receive a copy provided that they have submitted a completed copy of the User License Agreement for Digital Archive of Southern Speech (LDC2012S03). 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$250.

*

(2) ModeS TimeBank 1.0 was developed by researchers at Technical University of Madrid and Barcelona Media and is a corpus of Modern Spanish (17th and 18th centuries) annotated with temporal and event information according to TimeML mark-up and annotated with spatial information following the SpatialML scheme.

TimeML (Pustejovsky et al., 2005) is a specification language for annotating eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. SpatialML (Mani et al., 2008) is a specification language for annotating and normalizing spatial expressions by means of geographic coordinates.

ModeS TimeBank 1.0 contains 102 documents reporting a sea-crossing cruise by a ship called La Princesa, which took place from December 1768 to April 1769. There exist copious logbooks from that period that not only provide information about shipping routes, but also contain valuable data concerning information flows, commercial agents and social networks.

All text is encoded in UTF-8. The data in ModeS TimeBank 1.0 has been tokenized, POS-tagged, and annotated with space, time and event information according to the TimeML and SpatialML specification schemes.

ModeS TimeBank 1.0 is distributed via web download. 2012 Subscription Members will automatically receive two copies of this corpus on disc. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may request this data by completing a copy of the LDC User Agreement for Non-Members. The agreement can be faxed +1 215 573 2175 or scanned and emailed to this address. This data is available at no charge.

Friday, January 20, 2012

LDC January 2012 Newsletter

New publications:

LDC Celebrates its 20th Anniversary!
2012 marks LDC’s 20th Anniversary year – officially on April 15 – but this is cause for a yearlong celebration! From our founding in 1992 as a data repository and language resource distribution center, our online catalog has grown to include over 500 databases in 60 languages that have been licensed by over 3000 organizations from 80 different nations. This data has been made available through donations, funded projects at LDC or elsewhere, community initiatives, and from LDC resources, an indication of the collective strength of this consortium. LDC has evolved from an organization that shares language resources to one that also is at the forefront of language technology research that includes the development of new data resources, software tools, and standards and best practices.
As we celebrate throughout the year, look for announcements and special features in our newsletter and on our Facebook page.
2012 LDC Survey – Be on the Lookout!
It’s been four years since our last survey of LDC members and data licensees and we would like to again ask you to share your views on LDC and its language resources as well as your thoughts about data distribution in general and the impact of social media on language-related research and technology development. These topics are particularly timely as LDC enters its 20th anniversary year.
The 2012 LDC Survey will be sent to every person and organization that licensed LDC data and/or joined LDC as a Member during the period from 2009 through 2011. Those who complete the survey on or before February 7, 2012 will make their organization eligible for a $500 benefit to be applied to any corpus or membership purchase in 2012. LDC will conduct a blind drawing and one lucky winner will be chosen from the pool of respondents.
Many thanks for your continued support and for your participation in the 2012 Survey!
Membership Discounts for MY 2012 Still Available
If you are considering joining for Membership Year 2012 (MY2012), there is still time to save on membership fees. Any organization which joins or renews membership for 2012 through Thursday, March 1, 2012, is entitled to a 5% discount on membership fees. Organizations that held membership for MY2011 can receive a 10% discount on fees provided they renew prior to March 1, 2012. For further information on pricing, please consult our Announcements page or contact LDC.
New Publications
(1) 2006 NIST Speaker Recognition Evaluation Test Set Part 2 was developed by LDC and National Institute of Standards and Technology (NIST). It contains 568 hours of conversational telephone and microphone speech in English, Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu and associated English transcripts used as test data in the NIST-sponsored 2006 Speaker Recognition Evaluation (SRE).
The task of the 2006 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational telephone speech. The task was divided into 15 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the test conditions and additional documentation is available at the NIST web site for the 2006 SRE and within the 2006 SRE Evaluation Plan.
The speech data in this release was collected by LDC as part of the Mixer project, in particular Mixer Phases 1, 2 and 3. The Mixer project supports the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. The data is mostly English speech, but includes some speech in Arabic, Bengali, Chinese, Farsi, Hindi, Korean, Russian, Spanish, Thai and Urdu.
The telephone speech segments are multi-channel data collected simultaneously from a number of auxiliary microphones. The files are organized into four types: two-channel excerpts of approximately 10 seconds, two-channel conversations of approximately 5 minutes, summed-channel conversations also of approximately 5 minutes and a two-channel conversation with the usual telephone speech replaced by auxiliary microphone data in the putative target speaker channel. The auxiliary microphone conversations are also of approximately five minutes in length. English language transcripts in .ctm format were produced using an automatic speech recognition (ASR) system.
2006 NIST Speaker Recognition Evaluation Test Set Part 2 is distributed on seven DVDs. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.
*
(2) TORGO Database of Dysarthric Articulation was developed by the University of Toronto's departments of Computer Science and Speech Language Pathology in collaboration with the Holland-Bloorview Kids Rehabilitation Hospital in Toronto, Canada. It contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS) and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
CP and ALS are examples of dysarthria which is caused by disruptions in the neuro-motor interface that distort motor commands to the vocal articulators, resulting in atypical and relatively unintelligible speech in most cases. The TORGO database is primarily a resource for developing advanced automatic speaker recognition (ASR) models suited to the needs of people with dysarthria, but it is also applicable to non-dysarthric speech. The inability of modern ASR to effectively understand dysarthric speech is a problem since the more general physical disabilities often associated with the condition can make other forms of computer input, such as computer keyboards or touch screens, difficult to use.
The data consists of aligned acoustics and measured 3D articulatory features from the speakers carried out using the 3D AG500 electro-magnetic articulograph (EMA) system (Carstens Medizinelektronik GmbH, Lenglern, Germany) with fully-automated calibration. This system allows for 3D recordings of articulatory movements inside and outside the vocal tract, thus providing a detailed window on the nature and direction of speech-related activity.
All subjects read text consisting of non-words, short words and restricted sentences from a 19-inch LCD screen. The restricted sentences included 162 sentences from the sentence intelligibility section of Assessment of intelligibility of dysarthric speech (Yorkston & Beukelman, 1981) and 460 sentences derived from the TIMIT database. The unrestricted sentences were elicited by asking participants to spontaneously describe 30 images in interesting situations taken randomly from Webber Photo Cards - Story Starters (Webber, 2005), designed to prompt students to tell or write a story.
Data is organized by speaker and by the session in which each speaker recorded data. Each speaker's directory contains 'Session' directories which encapsulate data recorded in the respective visit and occasionally, a 'Notes' directory which can include Frenchay assessments (test for the measurement, description and diagnosis of dysarthria), notes about sessions (e.g., sensor errors), and other relevant notes.
TORGO Database of Dysarthric Articulation is distributed on 4 DVDs. 2012 Subscription Members will automatically receive two copies of this corpus. 2012 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1200.