Thursday, January 21, 2016

LDC 2016 January Newsletter

CFP for LREC 2016 Novel Incentives Workshop

LDC Membership Discounts for MY 2016 Still Available

New publications:
______________________________________________________________

CFP for LREC 2016 Novel Incentives Workshop
The first workshop on novel incentives in linguistic data collection will take place on May 28, 2016 in conjunction with the Tenth International Conference on Language Resources and Evaluation (LREC2016) in Portoroz, Slovenia.

Novel Incentives for Collecting Linguistic Data and Annotation from People: types, implementation, tasking requirements, workflow and results, opens the discussion on incentives in data collection describing novel approaches and comparing traditional monetary incentives.

The workshop is accepting papers through February 6, 2016. For more information visit the workshop webpage.


LDC Membership Discounts for MY 2016 Still Available
If you are considering joining LDC for Membership Year 2016 (MY2016), there is still time to save on membership fees.  Any organization which joins or renews membership for 2016 through March 1, 2016, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2015 can receive a 10% discount on fees provided they renew prior to March 1, 2016. Publications planned for release in 2016 include multilingual language packs, BOLT discussion forum and DEFT narrative text corpora, HAVIC video clips and transcripts and the latest Arabic and Chinese treebanks.

New publications
(1) Arabic Treebank - Weblog was developed by LDC and consists of Arabic weblog data with part-of-speech, morphology, gloss and syntactic tree annotation.

The ongoing Penn Arabic Treebank Project (PATB) supports research in Arabic-language natural language processing and human language technology development. Generally, the PATB consists of two distinct phases: (a) part-of-speech (POS) tagging, which divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic treebanking, which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces and so on.

The data contains 243,117 source tokens before clitics were split, and 308,996 tree tokens after clitics were separated for treebank annotation. The source material is weblogs collected by LDC from various sources.

Arabic Treebank - Weblog is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) NewSoMe Corpus of Opinion in Blogs was compiled at Barcelona Media and consists of English and Spanish blogs annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.

LDC has also released NewSoMe Corpus of Opinion in News Reports (LDC2015T17).

The data consists of 108 English documents and 191 Spanish documents. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.


NewSoMe Corpus of Opinion in Blogs is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) GALE Phase 4 Chinese Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from newsgroup and weblog data collected by LDC and translated by LDC or under its direction.

GALE Phase 4 Chinese Weblog Parallel Sentences includes 231 source-translation document pairs, comprising 92,501 tokens of Chinese source text and its English translation.

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.


GALE Phase 4 Chinese Weblog Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee. 

Wednesday, December 16, 2015

LDC 2015 December Newsletter

Renew your LDC membership today

Spring 2016 LDC Data Scholarship Program - deadline approaching

LDC at LSA 2016

LDC to close for Winter Break

New publications
________________________________________________________________________

Renew your LDC membership today
Membership Year 2016 (MY2016) discounts are available for those who keep their membership current and join early in the year. Check here for further information including our planned publications for MY2016.

Now is also a good time to consider joining LDC for the current and open membership years, MY2015 and MY2014.  MY2015 includes data such as RATS Speech Activity Detection and updates to Penn Treebank. MY2014 remains open through the end of the 2015 calendar year and its publications include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. For full descriptions of these data sets, visit our Catalog.

Spring 2016 LDC Data Scholarship Program - deadline approaching
The deadline for the Spring 2016 LDC Data Scholarship Program is right around the corner! Student applications are being accepted now through January 15, 2016, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.

LDC at LSA 2016
LDC will be exhibiting at the Annual Meeting of the Linguistic Society of America, held January 7-10, 2016 in Washington, DC. Stop by booth 110 to learn more about recent developments at the Consortium and new publications. Also, be on the lookout for the following presentations:

Satellite Workshop: Preparing Your Corpus for Archival Storage
Malcah Yaeger-Dror (University of Arizona) and Christopher Cieri (LDC)
Thursday, January 7, 2016 - 8:00am to 3:00pm, Salon 4

Broadening connections among researchers in linguistics and human language technologies
Jeff Good (University at Buffalo) and Christopher Cieri (LDC)
Friday, January 8, 2016 - 7:30am to 9:00am, Salon 1

Diachronic development of pitch contrast in Seoul Korean
Sunghye Cho (UPenn), Yong-cheol Lee (Cheongju University) and Mark Liberman (LDC)
Friday, January 8, 2016 - 2:00pm to 5:00pm, Salon 1

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!  

LDC to close for Winter Break
LDC will be closed from Friday, December 25, 2015 through Friday, January 1, 2016 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Monday, January 4, 2016. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.

New publications
(1) 2006 CoNLL Shared Task - Arabic & Czech consists of Arabic and Czech dependency treebanks used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing.

This corpus is cross listed with ELRA as ELRA-W0087.

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006, the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page.

This source data in this release consists principally of news and journal texts. The individual data sets are subsets of the following:
2006 CoNLL Shared Task - Arabic & Czech is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.


*
(2) 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish.

This corpus is cross listed and jointly released with ELRA as ELRA-W0086.

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006 , the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page.

The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. 
The individual data sets are:
2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora.  This data is being made available at no-cost for non-member organizations under a research license.
*
(3) GALE Phase 3 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 150 hours of Mandarin Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast News Transcripts (LDC2015T25).

The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: Anhui TV, China Central TV (CCTV), Phoenix TV and Voice of America (VOA).

This release contains 279 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 3 Chinese Broadcast News Speech is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.
*
(4) GALE Phase 3 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 150 hours of Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 3 Chinese Broadcast News Speech (LDC2015S13).

The broadcast news recordings for transcription feature news broadcasts focusing principally on current events from the following sources: Anhui TV,  China Central TV (CCTV), Phoenix TV and Voice of America (VOA).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,933,695 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. 

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Chinese Broadcast News Transcripts is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


Tuesday, November 17, 2015

LDC 2015 November Newsletter

Invitation to Join for Membership Year (MY) 2016

Commercial use and LDC data

Spring 2016 Data Scholarship Program

LDC closed for Thanksgiving Break

New publications:

Invitation to Join for Membership Year (MY) 2016

Membership Year (MY) 2016 is open for joining.  We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium.  For MY2016, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase.  Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.

The details of our early renewal discounts for MY2015 are as follows: 
  • Organizations who joined for MY2015 will receive a 10% discount when renewing before March 1, 2016. After March 1, 2016, MY2015 members are eligible for a 5% discount when renewing through the end of the year.
  • New members as well as organizations who did not join for MY2015, but who held membership in any of the previous MYs (1993-2014), will also be eligible for a 5% discount provided that they join/renew before March 1, 2016.

Publications for MY2016 are still being planned but we plan to release the following:
  • Arabic Treebank - Weblog ~ part-of-speech/morphological annotation and syntactic tree annotation of web text from various sources
  • BOLT ~ all phases, languages, genres, tasks
  • DEFT ~ Spanish and Chinese resources
  • Digital Archive of Southern Speech - NLP Version ~ colloquial speech in the Southern United States; NLP version normalizes filenames and formats
  • GALE Phase 3 and 4 data ~ all tasks and languages    
  • HAVIC ~ amateur video and transcripts
  • NewSoMe Corpus of Opinion in Blogs ~ opinion annotated English and Spanish blogs


Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases.  Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose.  LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information.
Spring 2016 Data Scholarship Program
Applications are now being accepted through Friday, January 15, 2016 for the Spring 2016 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost.  This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full nonmember fee for the data or to join the Consortium.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Spring 2016 program cycle is January 15, 2016.

LDC closed for Thanksgiving Break

LDC will be closed on Thursday, November 26, 2015 and Friday, November 27, 2015 in observance of the US Thanksgiving Holiday.  Our offices will reopen on Monday, November 30, 2015.

New publications
(1) Articulation Index LSCP was developed by researchers at Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises and enhances a subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons speaking English syllables. Changes include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions. AIC consists of 20 American English speakers (12 males, 8 females) pronouncing syllables, some of which form actual words, but most of which are nonsense syllables. All possible Consonant-Vowel (CV) and Vowel-Consonant (VC) combinations were recorded for each speaker twice, once in isolation and once within a carrier-sentence, for a total of 25768 recorded syllables.
Articulation Index LSCP alters AIC in the following ways.
  1. Time-alignments for the onset and offset of each word and syllable were generated through forced-alignment with a standard HMM-GMM (Hidden Markov Model-Gaussian Mixture Model) ASR system.
  2. The time-alignments for the beginning and end of the syllables (whether in isolation or within a carrier sentence) were manually adjusted. The time-alignments for the other words in carrier sentences were not manually adjusted
  3. The recordings of isolated syllables were cut according to the manual time-alignments to remove the silent portions at the beginning and end, and the time-alignments were altered to correspond to the cut recordings
  4. The file naming scheme was slightly altered for compatibility with the Kaldi speech recognition toolkit.de
  5. AIC contains a wide-band (16 KHz, 16-bit PCM) and a narrow-band (8 KHz, 8 bit u-law) version of the recordings distributed in sphere format. The LSCP version contains the wide-band version only distributed as wave files.
Articulation Index LSCP is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  This data is being made available at no-cost for non-member organizations under a research license.
*
(2) GALE Phase 4 Chinese Newswire Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from newswire data collected by LDC in 2008 and translated by LDC or under its direction.

GALE Phase 4 Chinese Newswire Parallel Sentences includes 627 source-translation document pairs, comprising 90,434 tokens of Chinese source text and its English translation. Data is drawn from six distinct Chinese newswire sources.

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts.  Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 4 Chinese Newswire Parallel Sentences is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) KHATT: Handwritten Arabic Text was developed by King Fahd University of Petroleum & Minerals, Technical University of Dortmund and Braunschweig University of Technology. It is comprised of scanned Arabic handwriting from 1,000 distinct male and female writers representing diverse countries, age groups, handedness and education levels. Participants produced text on a topic of their choice in an unrestricted style. KHATT was designed to promote research in areas such as text recognition and writer identification.

The majority of participants were natives of Saudi Arabia; the next largest group was from a collection of regional countries (Egypt, Jordan, Kuwait, Morocco, Palestine, Tunisia and Yemen). Most writers were between 16-25 years of age with high school or university qualifications.

KHATT: Handwritten Arabic Text is distributed on one USB drive.

2015 Subscription Members will automatically receive a copy of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


Friday, October 16, 2015

LDC 2015 October Newsletter

In this newsletter:
Fall 2015 LDC Data Scholarship recipients

New publications:
GALE Phase 4 Chinese  Broadcast News Parallel Sentences
Karlsruhe Children's Text


Fall 2015 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Fall 2015 data scholarships:
Anthony Beylerian - Keio University (Japan), MSc, Informatics and Computer Science.  Anthony has been awarded a copy of OntoNotes for his work in word sense disambiguation.
Siti Binte Faizal - Newcastle University (UK), PhD candidate, Speech and Language Sciences.  Siti has been awarded a copy of Levantine Arabic QT Training Speech and Text for her work in psycholinguistics.
Sara El-Kafrawy - Ain Shams University (Egypt), MSc candidate, Computer and Information Sciences.  Sara has been awarded a copy of GALE Arabic English Word Alignment and Arabic Gigaword for her work in machine translation.
Marwa Hadj Salah - University of Sfax (Tunisia), PhD candidate, Computer Science.  Marwa has been awarded a copy of Arabic English Parallel News and Arabic News Translation Text for her work in machine translation.
Tomoaki Goto - University of Tokyo (Japan), PhD candidate, Linguistics.  Tomoaki has been awarded a copy of Arabic Newswire English Translation for his work in syntax.
Richard Metzger - Pennsylvania State University (USA), PhD candidate, Electrical Engineering.  Richard has been awarded a copy of 2008 NIST Speaker Recognition Training Part 2 and Test for his work in speaker recognition.
Jun Ren - Massey University (New Zealand), PhD, Engineering.  Jun has been awarded a copy of TORGO Dysarthric Articulation for his work in speaker recognition.
Gozde Sahin - Istanbul Technical University (Turkey), PhD candidate, Computer Engineering and Informatics.  Gozde has been been awarded a copy of 2009 CoNLL Parts 1 and 2 for her work in semantic role labeling.
Alexey Sholokhov - University of Eastern Finland (Finland), PhD candidate, Computer Sciences.  Alexey has been awarded a copy of RATS Speech Activity Detection for his work in speaker verification.
Stefan Watson - University of the West Indies (Jamaica),  PhD candidate, Physics.  Stefan has been awarded a copy of CMU Kids for his work in phonology and speech recognition.
For program information visit the Data Scholarship page. 

New publications         
(1) ACE2007 Spanish DevTest - Pilot Evaluation was developed by LDC. This publication contains the complete set of Spanish development and test data to support the 2007 Automatic Content Extraction (ACE) technology evaluation, namely, newswire data annotated for entities and temporal expressions.

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.

LDC has also released ACE 2007 Multilingual Training Corpus (LDC2014T18) which contains the Arabic and Spanish training data used in the 2007 evaluation.

The data consists of newswire material published in May 2005 from the following sources: Agence France Press, The Associated Press and Xinhua News Agency.

All files were annotated by two human annotators working independently. Discrepancies between the two annotations were adjudicated by a senior team member resulting in a gold standard file.

There are three annotation directories for each newswire story that contain an identical copy of the source text in SGML format and two associated annotated versions in XML format and tab delimited format. All text is UTF-8 encoded.

ACE 2007 Spanish DevTest - Pilot Evaluation is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALEPhase 4 Chinese Broadcast News Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from broadcast news data collected by LDC in 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 4 Chinese Broadcast News Parallel Sentences includes 40 source-translation document pairs, comprising 156,429 tokens of Chinese source text and its English translation. Data is drawn from eight distinct Chinese programs broadcast in 2008 from China Central TV, a national and international broadcaster in Mainland China; and Voice of America, a U.S. government-funded broadcast programmer. The programs in this release feature news programs on current events topics.

The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.

GALE Phase 4 Chinese Broadcast News Parallel Sentences is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) KarlsruheChildren's Text was developed by the Cooperative State University Baden-Württemberg, University of Education and Karlsruhe Institute of Technology. It consists of over 14,000 freely written, German sentences from more than 1,700 school children in grades one through eight.

The data collection was conducted in 2011-2013 at elementary and secondary schools in and around Karlsruhe, Germany. Students were asked to write as verbose a text as possible. Those in grades one to four were read two stories and were then asked to write their own stories. Students in grades five through eight were instructed to write on a specific theme, such as "Imagine the world in 20 years. What has changed?”. The goal of the collection was to use the data to develop a spelling error classification system.

Annotators converted the handwritten text into digital form with all errors committed by the writers; they also created an orthographically correct version of every sentence. Metadata about the text was gathered, including the circumstances under which it was collected, information about the student writer and background about spelling lessons in the particular class. In a second step, the students' spelling errors were annotated into general groupings: grapheme level, syllable level, morphology and syntax. The files were anonymized in a third step.

This release also contains metadata regarding the writers’ language biography, teaching methodology, age, gender and school year. The average age of the participants was 11 years, and the gender distribution was nearly equal. Original handwriting is presented as JPEG format image files and the converted annotated text as UTF-8 plain text. Metadata is contained within each text file.

Karlsruhe Children's Text is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Wednesday, September 16, 2015

LDC 2015 September Newsletter

New Publications
_______________________________________________________________

New Publications
(1) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 was developed by LDC and contains 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. 

This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:
Language
Genre
Files
Words
CharTokens
Segments
Chinese
BC
69
67,782
101,674
2,276
Chinese
BN
29
94,242
141,364
3,152
Total

98
162,024
243,038
5,428

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.
The Chinese word alignment tasks consisted of the following components:
  • Identifying, aligning, and tagging eight different types of links
  • Identifying, attaching, and tagging local-level unmatched words
  • Identifying and tagging sentence/discourse-level unmatched words
  • Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALE Phase 3 and 4 Arabic Newswire Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from newswire data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

This data includes 551 source-translation document pairs, comprising 156,775 tokens of Arabic source text and its English translation. Data is drawn from seven distinct Arabic newswire sources: Agence France Presse, Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. The transcribed and segmented files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.  Source data and translations are distributed in TDF format.

GALE Phase 3 and 4 Arabic Newswire Parallel Text is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) NewSoMe Corpus of Opinion in News Reports was compiled at Barcelona Media and consists of Spanish, Catalan and Portuguese news reports annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.

The source data in this release was obtained from various newspaper websites and consists of approximately 200 documents in each of Spanish, Catalan and Portuguese. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.

NewSoMe Corpus of Opinion in News Reports is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Tuesday, August 18, 2015

LDC 2015 August Newsletter

Fall 2015 LDC Data Scholarship program - September 15 deadline approaching 

LDC at Interspeech 2015

2013 Data Pack deadline is September 15

LDC co-organizes LSA2016 Pre-conference Workshop

New publications:

Fall 2015 LDC Data Scholarship program - September 15 deadline approaching
Student applications for the Fall 2015 LDC Data Scholarship program are being accepted now through Tuesday, September 15, 2015, 11:59PM EST.  The LDC Data Scholarship program provides university students with access to LDC data at no cost.  This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. 

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser.  For further information on application materials and program rules, please visit the LDC Data Scholarship page.  

Applicants can email their materials to the LDC Data Scholarship program

LDC at Interspeech 2015
LDC will once again be exhibiting at Interspeech, held this year September 7-10 in Dresden, Germany.  Stop by booth 20 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Monday 7 September, Poster Session 3-9, 11:00–13:00
Investigating Consonant Reduction in Mandarin Chinese with Improved Forced Alignment: Jiahong Yuan and Mark Liberman (both LDC) 

Wednesday 9 September, Oral Session 36-5, 17:50-18:10

The Effect of Spectral Slope on Pitch Perception: Jianjing Kuang (UPenn) and Mark Liberman (LDC)

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!  

2013 Data Pack deadline is September 15
One month remains for not-for-profit and government organizations to create a custom data collection of eight corpora from among LDC’s 2013 releases. Selection options include: 1993-2007 United Nations Parallel Text, Chinese Treebank 8.0, CSC Deceptive Speech, GALE Arabic and Chinese speech and text releases, Greybeard, MADCAT training data, NIST 2012 Open Machine Translation (OpenMT) evaluation and progress sets, and more. The 2013 Data Pack is available for a flat rate of $3500 through September 15, 2015.

To license the Data Pack and select eight corpora, login or register for an LDC user account and add the 2013 Data Pack and each of the eight data sets to your bin. Follow the check-out procedure, sign all applicable user agreements and select payment via wire transfer, purchase order or check. LDC will adjust the invoice total to reflect the data pack fee.

To pay via credit card, add the 2013 Data Pack to your bin and check out using the system prompts. At the completion of the transaction, send an email to LDC indicating the eight data sets to include in your order. 

LDC co-organizes LSA2016 Pre-conference Workshop
University of Arizona’s Malcah Yeager-Dror and LDC’s Chris Cieri are organizing the upcoming LSA 2016 workshop “Preparing your Corpus for Archival Storage”. The session is sponsored by the National Science Foundation (BCS #1549994) and will be held on Thursday, January 7, 2016 in Washington, DC before the start of the 90th Annual Meeting of the Linguistic Society of America (LSA 2016).

The workshop will examine critical factors which must be considered when preparing data for comparison and sharing following on the topics discussed in the LSA 2012 workshop, "Coding for Sociolinguistic Archive Preparation". Invited speakers will discuss specific coding conventions for such factors as socioeconomic and educational speaker demographics, language choice, stance and footing.

There will be no additional registration fees to attend the session for those already taking part in the annual meeting. Students who are about to carry out their own fieldwork, or who have begun doing so, are eligible to apply for funding by November 2, 2015 to help defray the extra costs for attending the workshop. For more information about the speakers and topics, visit LDC’s workshop page.

New publications
(1) Arabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students from 67 nationalities studying at pre-university and university levels. The average length of an essay is 178 words.

Two tasks were used to collect the written data, and participants had the choice to do one or both of them. In each of those tasks, learners were asked to write a narrative about a vacation trip and a discussion about the participant's study interest. Those choosing the first task generated a 40 minute timed essay without the use of any language reference materials. In the second task, participants completed the writing as a take-home assignment over two days and were permitted to use language reference materials.

The audio recordings were developed by allowing students a limited amount of time to talk about the topics above without using language reference materials.

The original handwritten essays were transcribed into an electronic text format. The corpus data consists of three types: (1) handwritten sheets scanned in PDF format; (2) audio recordings in MP3 format; and (3) textual unicode data in plain text and XML formats (including the transcribed audio and transcripts of the handwritten essays). The audio files are either 44100Hz 2-channel or 16000Hz 1-channel mp3 files.

Arabic Learner Corpus is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus provided that they have completed the license agreement.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 was developed by LDC and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 (LDC2015T16).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Abu Dhabi TV, Al Alam News Channel,  Al Arabiya,  Aljazeera,  Al Ordiniyah,  Dubai TV,  Lebanese Broadcasting Corporation, Oman TV,  Saudi TV,  and Syria TV.

This release contains 149 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. 

GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 was developed by LDC and contains transcriptions of approximately 123 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 (LDC2015S11).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 733,233 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. 

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.