Linguistic Data Consortium: April 2011

Membership Mailbag - Commercial licenses and LDC data-

New Publications:

- Broadcast News Lattices -

- NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2 -

Membership Mailbag - Commercial licenses and LDC data

LDC's Membership office responds to thousands of emailed queries a year, and, over time, we've noticed that some questions tend to crop up with regularity. To address the questions that you, our data users, have asked, we'd like to continue our periodic Membership Mailbag series of newsletter articles. This month, we'll review how to obtain a commercial license to LDC data.

Our non-member research licenses permit non-commercial linguistic education and research use of data. Not-for-profit members and non-members, including non-member commercial organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. To gain commercial rights to data, an organization must join LDC as a for-profit member. For-profit members gain commercial rights to data from the year joined unless that right is otherwise restricted by a corpus-specific user license. Furthermore, for-profit members can license data for commercial use from closed Membership Years at the Reduced Licensing Fee. If membership is not renewed for the following year, the organization still retains ongoing commercial rights to data licensed as a For-Profit member and any data from the Membership Year. Note that the organization will not have a commercial license to any new data obtained after the Membership Year has ended, unless membership is renewed.

Simply put – organizations who have not signed LDC’s for-profit membership agreement and paid membership fees do not have a commercial license to any LDC data.

In the case of a handful of corpora, such as American National Corpus (ANC) Second Release (LDC2005T35), Buckwalter Arabic Morphological Analyzer Version 2.0 (LDC2004L02), CELEX2 (LDC96L14) and all CSLU corpora, commercial licenses must be obtained separately from the owners of the data even if an organization is a for-profit member. A full list of corpus-specific user licenses can be found on our License Agreements page.

Got a question? About LDC data? Forward it to ldc@ldc.upenn.edu. The answer may appear in a future Membership Mailbag article.

New Publications

(1) Broadcast News Lattices was developed by researchers at Microsoft and Johns Hopkins University (JHU) for the Johns Hopkins 2010 Summer Workshop on Speech Recognition with Conditional Random Fields. The lattices were generated using the IBM Attila speech recognition toolkit and were derived from transcripts of approximately 400 hours of English broadcast news recordings. They are intended to be used for training and decoding with Microsoft's Segmental Conditional Random Field (SCRF) toolkit for speech recognition, SCARF.

The goal of the JHU 2010 workshop was to advance the state-of-the-art in core speech recognition by developing new kinds of features for use in a SCRF. The SCRF approach generalizes Conditional Random Fields to operate at the segment level, rather than at the traditional frame level. Every segment is labeled directly with a word. Features are then extracted which each measure some form of consistency between the underlying audio and the word hypothesis for a segment. These are combined in a log-linear model (lattice) to produce the posterior possibility of a word sequence given the audio.

Broadcast News Lattices consists of training and test material, the source data for which was taken from various corpora distributed by LDC. The training lattices were derived from the following data sets:

1996 English Broadcast News Speech (HUB4) (LDC97S44); 1996 English Broadcast News Transcripts (HUB4) (LDC97T22) (104 hours)
1997 English Broadcast News Speech (HUB4) (LDC98S71); 1997 English Broadcast News Transcripts (HUB4) (LDC98T28) (97 hours)
TDT4 Multilingual Broadcast News Speech Corpus (LDC2005S11); TDT4 Multilingual Text and Annotations (LDC2005T16) (300 hours)

The test lattices are derived from the English broadcast news material in 2003 NIST Rich Transcription Evaluation Data (LDC2007S10).

The lattices were generated from an acoustic model that included LDA+MLLT, VTLN, fMLLR based SAT training, fMMI and mMMI discriminative training, and MLLR. The lattices are annotated with a field indicating the results of a second "confirmatory" decoding made with an independent speech recognizer. When there was a correspondence between a lattice link and the 1-best secondary output, the link was annotated with +1. Silence links are denominated with 0 and all others with -1. Correspondence was computed by finding the midpoint of a lattice link and comparing the link label with that of the word in the secondary decoding at that position. Thus, there are some cases where the same word shifted slightly in time receives a different confirmation score.

Broadcast News Lattices is distributed via web download.

2011 Subscription Members will automatically receive two copies of this corpus on disc. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$1000.

(2) NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2 was developed by researchers at the Department of Computer Science and Engineering, University of South Florida (USF), Tampa, Florida and the Multimodal Information Group at the National Institute of Standards and Technology (NIST). It contains approximately fourteen hours of meeting room video data collected in 2001 and 2002 at NIST's Meeting Data Collection Laboratory and annotated for the VACE (Video Analysis and Content Extraction) 2005 face, person and hand detection and tracking tasks.

The VACE program was established to develop novel algorithms for automatic video content extraction, multi-modal fusion, and event understanding. During VACE Phases I and II, the program made significant progress in the automated detection and tracking of moving objects including faces, hands, people, vehicles and text in four primary video domains: broadcast news, meetings, street surveillance, and unmanned aerial vehicle motion imagery. Initial results were also obtained on automatic analysis of human activities and understanding of video sequences. Three performance evaluations were conducted under the auspices of the VACE program between 2004 and 2007. The 2005 evaluation was administered by USF in collaboration with NIST and guided by an advisory forum including the evaluation participants.

NIST's Meeting Data Collection Laboratory is designed to collect corpora to support research, development and evaluation in meeting recognition technologies. It is equipped to look and sound like a conventional meeting space. The data collection facility includes five Sony EV1-D30 video cameras, four of which have stationary views of a center conference table (one view from each surrounding wall) with a fixed focus and viewing angle, and an additional "floating" camera which is used to focus on particular participants, whiteboard or conference table depending on the meeting forum. The data is captured in a NIST-internal file format. The video data was extracted from the NIST format and encoded using the MPEG-2 standard in NTSC format. Further information concerning the video data parameters can found in the documentation included with this corpus.

NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2 is distributed on 8 DVD-ROM.

2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$2500.

Linguistic Data Consortium

Monday, April 18, 2011

LDC April 2011 Newsletter