Linguistic Data Consortium: annotation

Showing posts with label annotation. Show all posts

Friday, April 13, 2018

LDC 2018 April Newsletter

LDC at ICASSP 2018

LDC at the Philadelphia Science Carnival

New Publications:

Concretely Annotated New York Times

H2, E2, ERK1 Children's Writing

TRAD Arabic-French Parallel Text -- Newsgroup

_____________________________________________________________________

LDC at ICASSP 2018

LDC will be exhibiting at ICASSP 2018, held this year April 15-20 in Calgary, Canada. Stop by booth B2 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Enhancement and Analysis of Conversational Speech: JSALT 2017
Tuesday, April 17, 16:00 - 18:00
Session: Speech Analysis

Leveraging LSTM Models for Overlap Detection in Multi-Party Meetings
Wednesday, April 18, 13:30 - 15:30
Session: Speaker Diarization & Identification

A Novel LSTM-based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions
Wednesday, April 18, 13:30 - 15:30
Session: Speaker Diarization & Identification

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

LDC at the Philadelphia Science Carnival

LDC will share the fun of language with the community on Saturday, April 28, with a booth at the Philadelphia Science Carnival. Visitors will enjoy three language-oriented educational activities that include a language identification game and Chinese character recognition.

The Philadelphia Science Carnival is an annual event organized by Philadelphia’s Franklin Institute to acquaint children and adults with the joys of science.

New publications:

(1) Concretely Annotated New York Times was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to The New York Times Annotated Corpus (LDC2008T19). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus.

Concretely Annotated New York Times is distributed via hard drive.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed The New York Times Annotated Corpus (LDC2008T19) may request a copy of Concretely Annotated New York Times (LDC2018T12) for a $250 media fee. Non-members may license this data for a fee.

(2) H2, E2, ERK1 Children's Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in this corpus was collected by elementary schools in Baden Württemberg, Germany, and digitized at the Cooperative State University during the 2016/2017 school year. Three second, third, and fourth grade classrooms participated in the collection. Texts were written within regular class settings. The students were presented with a picture and were asked to write a story to describe the picture or, if unable to write a text, to list what they saw in the picture.

There were 173 total participants. 100 students were multilingual, and further metadata is available for 166 of the 173 children. The following is included for each text in the database: school week of collection; school type; age; gender; grade/classroom; language spoken at home; and school materials used.

LDC has also released H1 Children's Writing (LDC2016T01).

H2, E2, ERK1 Children's Writing is distributed via web download.

(3) TRAD Arabic-French Parallel Text -- Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. This release consists of 398 segments (translations units) from 17 documents. The source data is Arabic newsgroup text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program.

LDC has also released TRAD Chinese-French Parallel Text -- Blog (LDC2018T02).

TRAD Arabic-French Parallel Text -- Newsgroup is distributed via web download.

Friday, May 15, 2015

LDC 2015 May Newsletter

Early renewing members save again

Commercial use and LDC data

New publications:

Coordination Annotation for the Penn Treebank
GALE Phase 3 Chinese Broadcast Conversation Speech Part 2
GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2
SenSem Lexicons

Early renewing members save again

LDC's early renewal discount program has resulted in substantial savings for current year members. The 110 organizations that renewed their membership or joined early for Membership Year 2015 (MY2015) saved over US$65,000 on membership fees. MY2014 members are still eligible for a 5% discount when renewing through 2015.

LDC membership benefits include free membership year data as well as discounts on older corpora. For-profit members can use most LDC data for commercial applications.

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information,

New publications

(1) Coordination Annotation for the Penn Treebank is a stand-off annotation for the Wall Street Journal portion of Treebank-3 (PTB3) (LDC99T42) developed by researchers at the University of Düsseldorf and Indiana University. It marks all tokens that have a coordinating function (potentially among other functions).

Coordination is a syntactic structure that links together two or more elements known as conjuncts or conjoins. The presence of coordination is often signaled by the appearance of a coordinator (coordinating conjunction), such as and, or, but in English.

This annotation is presented in a single UTF-8 plain text tsv file with columns as follows:

section: Penn Treebank WSJ section number

file: Number of file within section

sentence: Number of sentence (starting with 0)

token: Number of token (starting with 0)

annotation: "P" if the token is a coordinating punctuation, "O" otherwise

Coordination Annotation for the Penn Treebank is available at no cost to all licensees of PTB3 and appears in their download queue associated with LDC99T42 as penn_coordination_anno_LDC2015T08.tgz.

(2) GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 112 hours of Mandarin Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 (LDC2015T09).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature interviews, call-in programs, and roundtable discussions focusing principally on current events from the following sources: Beijing TV, China Central TV, Hubei TV, Phoenix TV and Voice of America.

This release contains 209 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 is distributed on DVD. 2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 112 hours of Chinese broadcast conversation speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding audio data is released as GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 (LDC2015S06).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,388,236 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 is distributed via web download. 2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) SenSem (Sentence Semantics) Lexicons was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida and the Universitat Oberta de Catalunya. It contains feature descriptions for approximately 1,300 Spanish verbs and 1,300 Catalan verbs in the SenSem Databank (LDC2015T02). GRIAL's work focuses on resources for applied linguistics, including lexicography, translation and natural language processing.

The verb features for each language consist of two groups: those codified manually, including definition, WordNet synset, Aktionsart, arguments and semantic functions; and those extracted automatically from the SenSem Databank. Among the latter are verb frequency, semantic construction, syntactic categories and constituent order. The verbs analyzed correspond to the 250 most frequent verbs in Spanish and 320 lemmas in Catalan. Further information about the SenSem project can be obtained from the GRIAL website. Data is presented in a single XML file per language.

SenSem Lexicons is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. This data is made available to LDC not-for-profit members and all non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license and to LDC for-profit members under the terms of the For-Profit Membership Agreement.