Friday, May 15, 2015

LDC 2015 May Newsletter

Early renewing members save again

Commercial use and LDC data

New publications:

Early renewing members save again

LDC's early renewal discount program has resulted in substantial savings for current year members. The 110 organizations that renewed their membership or joined early for Membership Year 2015 (MY2015) saved over US$65,000 on membership fees. MY2014 members are still eligible for a 5% discount when renewing through 2015.

LDC membership benefits include free membership year data as well as discounts on older corpora. For-profit members can use most LDC data for commercial applications. 

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases.  Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose.  LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information,

New publications

(1) Coordination Annotation for the Penn Treebank is a stand-off annotation for the Wall Street Journal portion of Treebank-3 (PTB3) (LDC99T42) developed by researchers at the University of Düsseldorf and Indiana University. It marks all tokens that have a coordinating function (potentially among other functions).

Coordination is a syntactic structure that links together two or more elements known as conjuncts or conjoins. The presence of coordination is often signaled by the appearance of a coordinator (coordinating conjunction), such as and, or, but in English.

This annotation is presented in a single UTF-8 plain text tsv file with columns as follows:
section: Penn Treebank WSJ section number
file: Number of file within section
sentence: Number of sentence (starting with 0)
token: Number of token (starting with 0)
annotation: "P" if the token is a coordinating punctuation, "O" otherwise
Coordination Annotation for the Penn Treebank is available at no cost to all licensees of PTB3 and appears in their download queue associated with LDC99T42 as penn_coordination_anno_LDC2015T08.tgz.

*

(2) GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 (LDC2015T09). 

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature interviews, call-in programs, and roundtable discussions focusing principally on current events from the following sources: Beijing TV, China Central TV, Hubei TV, Phoenix TV and Voice of America.

This release contains 209 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 is distributed on DVD.  2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 112 hours of Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding audio data is released as GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 (LDC2015S06). 

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,388,236 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 is distributed via web download.  2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(4) SenSem (Sentence Semantics) Lexicons was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida and the Universitat Oberta de Catalunya. It contains feature descriptions for approximately 1,300 Spanish verbs and 1,300 Catalan verbs in the SenSem Databank (LDC2015T02). GRIAL's work focuses on resources for applied linguistics, including lexicography, translation and natural language processing.

The verb features for each language consist of two groups: those codified manually, including definition, WordNet synset, Aktionsart, arguments and semantic functions; and those extracted automatically from the SenSem Databank. Among the latter are verb frequency, semantic construction, syntactic categories and constituent order. The verbs analyzed correspond to the 250 most frequent verbs in Spanish and 320 lemmas in Catalan. Further information about the SenSem project can be obtained from the GRIAL website. Data is presented in a single XML file per language.

SenSem Lexicons is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.  This data is made available to LDC not-for-profit members and all non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license and to LDC for-profit members under the terms of the For-Profit Membership Agreement.