New publications:
_________________________________________________________________________
New Corpora
(1) H1
Children's Writing was developed by the Cooperative State
University Baden-Württemberg, University of
Education. It consists of 996 texts
written over three months by 88 German school children age seven through eleven
years.
Texts were written within
regular class settings. The students were presented with a picture and were
asked to write a story, to describe the picture or if unable to write a text,
to list what they saw in the picture. The pictures were designed to enhance the
output with respect to important spelling error categories, namely, the marking
of short vowels with a silent consonant letter and the correct spelling of the
long vowel. The children were allowed at least 15 minutes to write the texts.
This exercise was repeated weekly for 12 weeks.
Most of the participants
were multilingual. The metadata with this releases includes: school week of
collection; school type (always elementary school); age; gender;
grade/classroom; language spoken at home; and school materials used for German
(Jojo).
In all, 996 texts
representing 62,764 tokens were collected. The texts were digitized in two
forms: (1) the original text, including all errors (achieved), and (2) the
intended (target) text, where all spelling errors were removed. Annotations
were added to both the achieved text and the target text to distinguish words
that should not be analyzed for spelling errors, such as names or foreign
words. For sentence-level analysis, syntax errors were annotated by marking
substitutions, deletions and insertions at the word level. In such cases, the
used word was analyzed for spelling, and the correct word was used for sentence
structure analysis.
Original handwriting is
presented as pdf documents and the converted text as UTF-8 plain text in csv
documents.
H1 Children's Writing is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) GALE
Phase 4 Arabic Broadcast Conversation Parallel Sentences was developed by LDC. Along with other corpora, the
parallel text in this release comprised training data for Phase 4 of the DARPA
GALE (Global Autonomous Language Exploitation) Program. This corpus contains
Modern Standard Arabic source sentences and corresponding English translations
selected from broadcast conversation data collected by LDC in 2007 and 2008 and
transcribed and translated by LDC or under its direction.
GALE Phase 4 Arabic Broadcast Conversation Parallel
Sentences is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for
a fee.
*
(3) HAVIC
Pilot Transcription was developed by LDC and is
comprised of approximately 72 hours of user-generated videos with transcripts
based on the English speech audio extracted from the videos. This data set was
created in collaboration with NIST (the National
Institute of Standards and Technology) as part of the HAVIC (the Heterogeneous Audio Visual Internet Collection)
project, the goal of which is to advance multimodal event detection and related
technologies.
Annotators generated the transcripts using XTrans, which supports manual transcription across multiple channels, languages and platforms. HAVIC transcription guidelines are included in the documentation for this release. All transcription files are in .tdf format, a plain-text, flat-table format with 13 tab-delimited fields. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution.
HAVIC Pilot Transcription
is distributed via web download.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part of
their 16 free membership corpora. Non-members may license this data for a fee.