New publications:
LDC2011V05
LDC2011S07
- 2008 NIST Speaker Recognition Evaluation Training Set Part 2
LDC2011T10
- French Gigaword Third Edition
   
 
Cataloging the communication of Asian Elephants
LDC distributes a broad selection of databases, the majority of which are used for human language research and technology development. Our corpus catalog also includes the vocalizations of other animal species. We'd like to highlight the intriguing work behind one such animal communication corpus, Asian Elephant Vocalizations, LDC2010S05.
  Asian Elephant Vocalizations contains audio recordings of       vocalizations by Asian Elephants (Elephas maximus) in Uda Walawe       National Park, Sri Lanka.  The data was collected by Shermin de       Silva as part of her doctoral thesis at the University of       Pennsylvania. Recordings were made using a Fostex field recorder       with a Sennheiser 'shot-gun' microphone.  In addition, de Silva       utilized a second dictation microphone that allows observers to       narrate what's happening without talking over the elephant       recording.  The digital files were then downloaded and visualized       using the Praat TextGrid Editor,  a tool originally developed for studying human       speech which has since been adopted by elephant researchers.  With       Praat, trained annotators are able to characterize call types and       extract particular segments for later analysis.
Until  recently, the majority         of research on the behavior of wild elephants focused on one         species - the African savannah elephant.                      There has been comparatively less study of communication in Asian       elephants, primarily because the habitat in which Asian elephants       typically live makes them more difficult to study than African       forest elephants. Asian and African elephants diverged from one       another approximately six million years ago and  evolved       separately in very distinct environments. de Silva's work has       shown that Asian elephants have highly dynamic       social lives, that are markedly different from that of African       elephants.  Asian elephants tend to form smaller, fragmented       groups on a day-to-day basis but maintain long-term pools of       companions over many years.  Because communication in elephants       appears to be largely socially-motivated, differences in social       behavior and ecology may also be a source of differences in their       vocal behavior and repertoire.
 de Silva and her colleagues study elephant communication as an       opportunity to understand the evolution of social behavior and       communication in a system that is very different from our own       primate experience.  Human language is only one manifestation of       communication in the natural world. Perhaps this is why it is       fitting to place animal vocalizations side-by-side with human       speech in LDC's catalog.   In this way, we can better understand       how human language relates to the communicative capabilities of       other species.       
 For further information on Shermin de Silva's current research at       the        Elephant Forest and Environment Conservation Trust       visit:
 Web:         http://elephantresearch.net       
 Blog: http://elephantresearch.net/fieldnotes/
(1) 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1 was developed by researchers at the Department of Computer Science and Engineering, University of South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National Institute of Standards and Technology (NIST). It contains approximately fifteen hours of meeting room video data collected in 2005 and 2006 and annotated for the VACE (Video Analysis and Content Extraction) 2006 face and person tracking tasks.
The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding. During VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects including faces, hands, people, vehicles and text in four primary video domains: broadcast news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial results were also obtained on automatic analysis of human activities and understanding of video sequences.
Three performance evaluations were conducted under the auspices of the VACE program between 2004 and 2007. In 2006, the VACE program and the European Union's Computers in the Human Interaction Loop (CHIL)CLassification of Events, Activities and Relationships (CLEAR) Evaluation. This was an international effort to evaluate systems designed to analyze people, their identities, activities, interactions and relationships in human-human interaction scenarios, as well as related scenarios. The VACE program contributed the evaluation infrastructure (e.g., data, scoring, tools) for a specific set of tasks, and the CHIL consortium, coordinated by the Karlsruhe Institute of Technology, contributed a separate set of evaluation infrastructure. collaborated to hold the
The meeting room data used for the 2006 test       set was collected by the following sites in 2005 and 2006:       Carnegie Mellon University (USA), University of Edinburgh       (Scotland), IDIAP Research Institute (Switzerland), NIST (USA),       Netherlands Organization for Applied Scientific Research       (Netherlands) and Virginia Polytechnic Institute and State       University (USA). Each site had its own independent camera setup,       illuminations, viewpoints, people and topics. Most of the datasets       included High-Definition (HD) recordings, but those were       subsequently formatted to MPEG-2 for the evaluation.
 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting       Data Test Set Part 1 is distributed on 9 DVD-ROM. 2011 Subscription Members will automatically receive two copies of       this corpus. 2011 Standard Members may request a copy as part of       their 16 free membership corpora. Non-members may license this       data for $2500.       
     
*
(2) 2008 NIST Speaker Recognition Evaluation Training Set Part 2 was developed by LDC and NIST (National Institute of Standards and Technology). It contains 950 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as training data in the 2008 NIST Speaker Recognition Evaluation (SRE). SRE is part of an ongoing series of evaluations conducted by NIST. These evaluations are an important contribution to the direction of research efforts and the calibration of technical capabilities. They are intended to be of interest to all researchers working on the general problem of text independent speaker recognition. To this end the evaluation is designed to be simple, to focus on core technology issues, to be fully supported, and to be accessible to those wishing to participate.
The 2008 evaluation was distinguished from prior evaluations, in particular those in 2005 and 2006, by including not only conversational telephone speech data but also conversational speech data of comparable duration recorded over a microphone channel involving an interview scenario.
The speech data in this release was collected in 2007 by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley. This collection was part of the Mixer 5 project, which was designed to support the development of robust speaker recognition technology by providing carefully collected and audited speech from a large pool of speakers recorded simultaneously across numerous microphones and in different communicative situations and/or in multiple languages. Mixer participants were native English speakers and bilingual English speakers. The telephone speech in this corpus is predominately English; all interview segments are in English. Telephone speech represents approximately 523 hours of the data, and microphone speech represents the other 427 hours.
The telephone speech segments include summed-channel excerpts in the range of 5 minutes from longer original conversations. The interview material includes single channel conversation interview segments of at least 8 minutes from a longer interview session. English language transcripts were produced using an automatic speech recognition (ASR) system.
2008 NIST Speaker Recognition Evaluation       Training Set Part 2 is distributed on 7 DVD-ROM. 2011 Subscription Members will automatically       receive two copies of this corpus. 2011 Standard Members may       request a copy as part of their 16 free membership corpora.       Non-members may license this data for $2000.
     
*
     
(3) French Gigaword Third Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This third edition updates French Gigaword Second Edition (LDC2009T28) and adds material collected from January 1, 2009 through December 31, 2010.
The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows:
- Agence France-Presse (afp_fre) May 1994 - Dec. 2010
- Associated Press French Service (apw_fre) Nov. 1994 - Dec. 2010
All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII, white space, and printable code points in the "Latin1 Supplement" character table, as defined by the Unicode Standard (ISO 10646) for the "accented" characters used in French. The Supplement/accented characters are presented in UTF-8 encoding.
The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data when the files are uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of white space-separated tokens (of all types) after all SGML tags are eliminated.
| Source | #Files | Gzip-MB | Totl-MB | K-wrds | #DOCs | 
| afp_fre | 195 | 1503 | 4255 | 641381 | 2356888 | 
| apw_fre | 194 | 489 | 1446 | 221470 | 801075 | 
| TOTAL | 389 | 1992 | 5701 | 862851 | 3157963 | 
French Gigaword Third Edition is distributed on 1 DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$4500.
 
No comments:
Post a Comment