Friday, July 15, 2011

LDC July 2011 Newsletter

LDC Sponsors a Student Group at 2011 International Linguistics Olympiad

LDC Receives META Prize from META-NET

New publications:

LDC2011S04

- 2005 NIST Speaker Recognition Evaluation Test Data -

LDC2011S03

- 2006 NIST Spoken Term Detection Evaluation Set -

LDC2011V04

- NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 -



LDC Sponsors a Student Group at 2011 International Linguistics Olympiad

LDC is happy to support the 2011 International Linguistics Olympiad by sponsoring a student team. The IOL is one of the twelve International Science Olympiads and is an annual event that brings together students from around the world to compete in linguistically–based challenges. This year’s competition takes place from July 24-30 at Carnegie Mellon University, Pittsburgh, PA USA. Students do not need to have a background in linguistics in order to participate since they typically use analysis and deductive reasoning to solve the competition problems.

Please visit the 2011 IOL website for additional details. We wish good luck to all of the participants!

LDC Receives META Prize from META-NET

LDC was awarded a ‘2nd META Prize’ from META-NET ‘for outstanding long term commitment to the preparation and distribution of language resources and technologies.’

The META Prize is awarded by META-NET to those who provide outstanding products or services that support the European Multilingual Information Society. META-NET is a Network of Excellence dedicated to fostering the technological foundations of a multilingual European information society. Several organizations were honored at this year’s META Forum in Budapest; LDC and ELRA were both honored for supporting and developing language resources.

New Publications

(1) 2005 NIST Speaker Recognition Evaluation Test Data was developed at LDC and NIST (National Institute of Standards and Technology). It consists of 525 hours of conversational telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated English transcripts used as test data in the NIST-sponsored 2005 Speaker Recognition Evaluation (SRE). The ongoing series of SRE yearly evaluations conducted by NIST are intended to be of interest to researchers working on the general problem of text independent speaker recognition. To that end the evaluations are designed to be simple, to focus on core technology issues, to be fully supported and accessible.

The task of the 2005 SRE evaluation was speaker detection, that is, to determine whether a specified speaker is speaking during a given segment of conversational speech. The task was divided into 20 distinct and separate tests involving one of five training conditions and one of four test conditions. Further information about the task conditions is contained in the The NIST Year 2005 Speaker Recognition Evaluation Plan.

The speech data consists of conversational telephone speech with "multi-channel" data collected by LDC simultaneously from a number of auxiliary microphones. The files are organized into two segments: 10 second two-channel excerpts (continuous segments from single conversations that are estimated to contain approximately 10 seconds of actual speech in the channel of interest) and 5 minute two-channel conversations.

The data are stored as 8-bit u-law speech signals in NIST SPHERE format. In addition to the standard header fields, the SPHERE header for each file contains some auxiliary information that includes the language of the conversation and whether the data was recorded over a telephone line. English language word transcripts in .cmt format were produced using an automatic speech recognition system (ASR) with error rates in the range of 15-30%.

2005 NIST Speaker Recognition Evaluation Test Data is distributed on 7 DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2000.

*

(2) 2006 NIST Spoken Term Detection Evaluation Set was compiled by researchers at NIST (National Institute of Standards and Technology) and contains approximately eighteen hours of Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NIST's 2006 Spoken Term Detection (STD) evaluation. The STD initiative is designed to facilitate research and development of technology for retrieving information from archives of speech data with the goals of exploring promising new ideas in spoken term detection, developing advanced technology incorporating these ideas, measuring the performance of this technology and establishing a community for the exchange of research results and technical insights.

The 2006 STD task was to find all of the occurrences of a specified "term" (a sequence of one or more words) in a given corpus of speech data. The evaluation was intended to develop technology for rapidly searching very large quantities of audio data. Although the evaluation used modest amounts of data, it was structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements to much larger data sets. Therefore, systems were implemented in two phases: indexing and searching. In the indexing phase, the system processes the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences.

The evaluation corpus consists of three data genres: broadcast news (BNews), conversational telephone speech (CTS) and conference room meetings (CONFMTG). The broadcast news material was collected in 2003 and 2004 by LDC's broadcast collection system from the following sources: ABC (English), Aljazeera (Arabic), China Central TV (Chinese), CNN (English), CNBC (English), Dubaie TV (Arabic), New Tang Dynasty TV (Chinese), Public Radio International (English) and Radio Free Asia(Chinese). The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English Training Speech Part 1 LDC2004S13), also collected by LDC. The conference room meeting material consists of goal-oriented, small group round table meetings and was collected in 2004 and 2005 by NIST, the International Computer Science Institute (Berkeley, California), Carnegie Mellon University (Pittsburgh, PA), TNO (The Netherlands) and Virginia Polytechnic Institute and State University (Blacksburg, VA) as part of the AMI corpus project. This evaluation corpus includes scoring software. It uses the inputs described in the STD Evaluation plan to complete the evaluation of a system.

Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. The CONFMTG files contain a single recorded channel.

2006 NIST Spoken Term Detection Evaluation Set is distributed on 1 DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$800.

*

(3) NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 was developed by researchers at the Department of Computer Science and Engineering, University of South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National Institute of Standards and Technology (NIST). It contains approximately thirteen hours of meeting room video data collected in 2001 and 2002 at NIST's Meeting Data Collection Laboratory and used in the VACE (Video Analysis and Content Extraction) 2005 evaluation.

The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding. During VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects including faces, hands, people, vehicles and text in four primary video domains: broadcast news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial results were also obtained on automatic analysis of human activities and understanding of video sequences.

Three performance evaluations were conducted under the auspices of the VACE program between 2004 and 2007. The 2005 evaluation was administered by USF in collaboration with NIST and guided by an advisory forum including the evaluation participants.

LDC has previously released NIST/USF Evaluation Resources for the VACE Program -- Meeting Data Training Set Part 1 LDC2011V01, NIST/USF Evaluation Resources for the VACE Program -- Meeting Data Training Set Part 2 LDC2011V02 and NIST/USF Evaluation Resources for the VACE Program -- Meeting Data Test Set Part 1 LDC2011V03.

NIST's Meeting Data Collection Laboratory is designed to collect corpora to support research, development and evaluation in meeting recognition technologies. It is equipped to look and sound like a conventional meeting space. The data collection facility includes five Sony EV1-D30 video cameras, four of which have stationary views of a center conference table (one view from each surrounding wall) with a fixed focus and viewing angle, and an additional "floating" camera which is used to focus on particular participants, whiteboard or conference table depending on the meeting forum. The data is captured in a NIST-internal file format. The video data was extracted from the NIST format and encoded using the MPEG-2 standard in NTSC format. Further information concerning the video data parameters can found in the documentation included with this corpus.

NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 is distributed on 8 DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2500.