Linguistic Data Consortium: LDC April 2016 Newsletter

Monday, April 18, 2016

LDC April 2016 Newsletter

New publications:

GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences

_________________________________________________________________________

New Corpora

(1) H1 Children's Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of 996 texts written over three months by 88 German school children age seven through eleven years.

Texts were written within regular class settings. The students were presented with a picture and were asked to write a story, to describe the picture or if unable to write a text, to list what they saw in the picture. The pictures were designed to enhance the output with respect to important spelling error categories, namely, the marking of short vowels with a silent consonant letter and the correct spelling of the long vowel. The children were allowed at least 15 minutes to write the texts. This exercise was repeated weekly for 12 weeks.

Most of the participants were multilingual. The metadata with this releases includes: school week of collection; school type (always elementary school); age; gender; grade/classroom; language spoken at home; and school materials used for German (Jojo).

In all, 996 texts representing 62,764 tokens were collected. The texts were digitized in two forms: (1) the original text, including all errors (achieved), and (2) the intended (target) text, where all spelling errors were removed. Annotations were added to both the achieved text and the target text to distinguish words that should not be analyzed for spelling errors, such as names or foreign words. For sentence-level analysis, syntax errors were annotated by marking substitutions, deletions and insertions at the word level. In such cases, the used word was analyzed for spelling, and the correct word was used for sentence structure analysis.

Original handwriting is presented as pdf documents and the converted text as UTF-8 plain text in csv documents.

H1 Children's Writing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source sentences and corresponding English translations selected from broadcast conversation data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

The data includes 170 source-translation document pairs, comprising 44,064 words (Arabic source) of translated data. Data is drawn from 45 distinct Arabic broadcast conversation sources.

GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences is distributed via web download.

(3) HAVIC Pilot Transcription was developed by LDC and is comprised of approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos. This data set was created in collaboration with NIST (the National Institute of Standards and Technology) as part of the HAVIC (the Heterogeneous Audio Visual Internet Collection) project, the goal of which is to advance multimodal event detection and related technologies.

LDC has developed a large, heterogeneous, annotated multimodal corpus for HAVIC that has been used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC Pilot Transcription supported an experiment to produce a verbatim transcript (quick and rich transcription) based on audio extracted from user-generated videos. It contains the pilot transcripts for selected MED 2011 video files as well as the associated videos.

Annotators generated the transcripts using XTrans, which supports manual transcription across multiple channels, languages and platforms. HAVIC transcription guidelines are included in the documentation for this release. All transcription files are in .tdf format, a plain-text, flat-table format with 13 tab-delimited fields. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution.

HAVIC Pilot Transcription is distributed via web download.

7 comments:

anonymousJuly 4, 2018 at 5:03 AM
Keep it up!
ReplyDelete
Replies
Jessie E. BakerJuly 5, 2018 at 4:00 AM
You have given awesome data about Voiceover and Audio Production News. In this post enormously portray female voiceover craftsmen, Andrea recorded a http://www.verbatimtranscriptionservices.com/ web video for Life Cash, male voiceover specialists, Ricky voiced web recordings and some more.
ReplyDelete
Replies
Christopher BlevinsJuly 27, 2018 at 11:27 AM
You've provided amazing information regarding Voiceover as well as Sound Manufacturing Information. On this page significantly depict woman voiceover craftsmen, Andrea documented the http://www.cheaptranscriptionservices.net/how-to-convert-audio-file-to-text-online/ internet movie for a lifetime Money, man voiceover professionals, Roublesome voiced internet recordings plus some much more.
ReplyDelete
Replies
AnonymousAugust 30, 2018 at 12:24 PM
In that time the news letter is really create some issues as it change the overall thought about the upcoming news. http://www.appdemovideo.net/app-promotion-video-services/ to check out more helpful guide and tips for writing your papers.
ReplyDelete
Replies
GabrooSeptember 23, 2018 at 1:15 AM
Enjoy the best newsletter from this section and manage the great time in this perfect zone. You can get your link of success from http://www.retype.biz/professional-typing-services/professional-audio-typing-service/ and can visit different sites from this perfect zone. Just handle the new voice to achieve the best target.
ReplyDelete
Replies
RoySeptember 23, 2018 at 9:15 AM
Wonderful blog post, Direction accept a fundamental part in a man's life. It ensures a man a rich life since it guarantees an awesome activity or a fair business opportunity. It changes a man to go my blog ahead with an unrivaled life and certification the same for his relatives.
ReplyDelete
Replies
harisOctober 10, 2018 at 8:59 AM
Writing is considered to be very important thing for students. This is writer's choice for transcript services is best. http://www.besttypingservices.net/the-best-way-to-change-a-pdf-into-a-word-document/ is all always preferred.

ReplyDelete
Replies

Add comment