Thursday, June 16, 2016

LDC June 2016 Newsletter

Commercial use and LDC data

New publications:
_______________________________________________________________
Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for more information.

New Corpora

(1) Chinese Treebank 9.0 consists of approximately two million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech. This new data set in the Chinese Treebank series adds more annotated web data and two new genres – chat messages and transcribed telephone speech.

There are 3,726 text files in this release, containing 132,076 sentences, 2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is provided in the UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked.

Chinese Treebank 9.0 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker metadata. The goal of this work was to support spoken term detection and forensic speaker identification.

This corpus is comprised of Mexican Spanish microphone speech from 75 male speakers and 75 female speakers in a quiet office environment. Speakers could answer pre-selected open questions or describe a particular painting shown to them on a computer monitor. Speaker metadata in this release includes age, gender, place of birth, place of residence and parents' nationalities.

CHM150 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.

*

(3) GALE Phase 4 Arabic Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations, selected from newsgroup and weblog data collected by LDC and translated by LDC or under its direction.

The data includes 1,067 source-translation document pairs, comprising 68,346 words (Arabic source) of translated data. 

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 4 Arabic Weblog Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, May 16, 2016

LDC May 2016 Newsletter

LDC at LREC 2016

New publications:
GALE Phase 4 Chinese Broadcast Conversation Speech
GALE Phase 4 Chinese Broadcast Conversation Transcripts 
_______________________________________________________________

LDC at LREC 2016

LDC will attend the 10th Language Resource Evaluation Conference (LREC2016), hosted by ELRA, the European Language Resource Association. The conference will be held in Portorož, Slovenia from May 23-28 and features a broad range of sessions on language resources and human language technologies research. Seven LDC staff members will be presenting current work on topics including trends in HLT research, building language resources for autism spectrum disorders, data management plans, rapid development of morphological analyzers for typologically diverse languages, selection criteria for low resource language programs, multi-language speech collection for NIST LRE, novel incentives for collecting data and annotation from people, and more.

Following the conference, LDC’s presented papers and posters will be available on LDC’s Papers Page.


New Corpora

(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing consists of data, tools, system results, and publications associated with the 2014 and 2015 tasks on Broad-Coverage Semantic Dependency Parsing (SDP) conducted in conjunction with the International Workshop on Semantic Evaluation (SemEval) and was developed by the SDP task organizers.

SemEval is an ongoing series of evaluations of computational semantic analysis systems intended to explore the nature of meaning in language. It evolved from the Senseval word sense disambiguation series to include semantic analysis tasks outside of word sense disambiguation.

This release is based on English, Chinese and Czech data from the following resources: Treebank-2 LDC95T17, Proposition Bank I LDC2004T14, NomBaank v 1.0 LDC2008T23 and CCGBank LDC2005T13 (English); Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21) (Chinese); and Prague Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01) (Czech).

The results are presented as graphs in three target representations: MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures (PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional target representation CCGbank was converted to semantic dependency graphs (in the subdirectory ‘ccd’).


SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 4 Chinese Broadcast Conversation Speech was developed by LDC and is comprised of approximately 172 hours of Mandarin Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast Conversation Transcripts (LDC2016T12).

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 236 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.


GALE Phase 4 Chinese Broadcast Conversation Speech is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALE Phase 4 Chinese Broadcast Conversation Transcripts was developed by LDC and contains transcriptions of approximately 172 hours of Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast Conversation Speech (LDC2016S03).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 2,259,952 tokens.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR). QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. QRTR adds additional structural information such as topic boundaries and manual sentence unit annotation.


GALE Phase 4 Chinese Broadcast Conversation Transcripts is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


Monday, April 18, 2016

LDC April 2016 Newsletter

New publications:

_________________________________________________________________________

New Corpora

(1) H1 Children's Writing was developed by the Cooperative State University Baden-WürttembergUniversity of Education. It consists of 996 texts written over three months by 88 German school children age seven through eleven years.

Texts were written within regular class settings. The students were presented with a picture and were asked to write a story, to describe the picture or if unable to write a text, to list what they saw in the picture. The pictures were designed to enhance the output with respect to important spelling error categories, namely, the marking of short vowels with a silent consonant letter and the correct spelling of the long vowel. The children were allowed at least 15 minutes to write the texts. This exercise was repeated weekly for 12 weeks.

Most of the participants were multilingual. The metadata with this releases includes: school week of collection; school type (always elementary school); age; gender; grade/classroom; language spoken at home; and school materials used for German (Jojo).

In all, 996 texts representing 62,764 tokens were collected. The texts were digitized in two forms: (1) the original text, including all errors (achieved), and (2) the intended (target) text, where all spelling errors were removed. Annotations were added to both the achieved text and the target text to distinguish words that should not be analyzed for spelling errors, such as names or foreign words. For sentence-level analysis, syntax errors were annotated by marking substitutions, deletions and insertions at the word level. In such cases, the used word was analyzed for spelling, and the correct word was used for sentence structure analysis.

Original handwriting is presented as pdf documents and the converted text as UTF-8 plain text in csv documents.

H1 Children's Writing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source sentences and corresponding English translations selected from broadcast conversation data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

The data includes 170 source-translation document pairs, comprising 44,064 words (Arabic source) of translated data. Data is drawn from 45 distinct Arabic broadcast conversation sources.

GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) HAVIC Pilot Transcription was developed by LDC and is comprised of approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos. This data set was created in collaboration with NIST (the National Institute of Standards and Technology) as part of the HAVIC (the Heterogeneous Audio Visual Internet Collection) project, the goal of which is to advance multimodal event detection and related technologies.

LDC has developed a large, heterogeneous, annotated multimodal corpus for HAVIC that has been used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC Pilot Transcription supported an experiment to produce a verbatim transcript (quick and rich transcription) based on audio extracted from user-generated videos. It contains the pilot transcripts for selected MED 2011 video files as well as the associated videos.

Annotators generated the transcripts using XTrans, which supports manual transcription across multiple channels, languages and platforms. HAVIC transcription guidelines are included in the documentation for this release. All transcription files are in .tdf format, a plain-text, flat-table format with 13 tab-delimited fields. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution.

HAVIC Pilot Transcription is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee. 

Tuesday, March 15, 2016

LDC March 2016 Newsletter

New publications:

GALE Phase 3 and 4 Arabic Web Parallel Text 
GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
_____________________________________________________________________

New Corpora
(1) DEFT Narrative Text was developed by LDC and contains proxy reports and their source newswire used to support DARPA's Deep Exploration and Filtering of Text (DEFT) program. One of the goals of the DEFT program was to develop technologies that can perform various NLP tasks on data in a variety of genres, both formal and informal.

LDC provided source data and annotations for DEFT system development. DEFT Narrative Text consists of "proxy reports" (and "multi-proxy reports") in English. (Multi-)proxy reports are intended to mimic the format and other features of some types of government analyst reports using content from newswire articles. The corresponding English newswire source documents are also included in the release.

LDC staff manually selected the source newswire from English Gigaword Fifth Edition (LDC2011T07).

The newswire source documents are XML files following the Gigaword corpus format. The proxy reports are in plain text format.

DEFT Narrative Text is distributed via web download.
2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALEPhase 3 and 4 Arabic Web Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.

The data includes 124 source-translation document pairs, comprising 61,662 tokens of Arabic source text and its English translation. Data is drawn from four various Arabic weblog and newsgroup sources.

GALE Phase 3 and 4 Arabic Web Parallel Text is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALEPhase 3 and 4 Chinese Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast conversation data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

This data includes 63 source-translation document pairs, comprising 487,466 tokens of Chinese source text and its English translation. Data is drawn from 19 distinct Chinese programs broadcast between 2006 and 2008.

Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Monday, February 15, 2016

LDC 2016 February Newsletter

­­­­­­­­­­­­­Only two weeks left to enjoy 2016 membership savings

Spring 2016 LDC Data Scholarship recipients

How to Share Data through LDC webinar on YouTube

New publications:
_______________________________________________________________________

Only two weeks left to enjoy 2016 membership savings
There’s still time to save on 2016 membership fees. Now through March 1, all organizations receive a 5% discount when they join for MY2016. MY2015 members are eligible for an additional 5% off the fee (10% total savings) when they renew before March 1.  

To join, create or sign into your LDC user account, select your preferred membership type from the Catalog, add the item to your bin and follow the check-out process. The Membership Office will apply any discounts. Alternatively, if you have already received a renewal invoice from LDC, you can simply pay against that.

For more information on the benefits of membership, visit Join LDC.

Spring 2016 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Spring 2016 data scholarships:

Shefali Waldekar: Indian Institute of Technology Kharagpur (India), PhD Candidate, Electronics and Electrical Communications Engineering. Shefali is awarded copies of 2002 Rich Transcription Broadcast News and Conversational Telephone Speech and 2005 Spring NIST Rich Transcription (RT05-S) Evaluation Set for her research in audio diarization.

Nikola Invanov Nikolov: University of Zurich and ETH Zurich (Switzerland), MSc candidate in Informatics. Nikola is awarded a copy of Annotated English Gigaword for his research in text summarization.

Om Prakash Singh: Indian Institute of Technology, Guwahati (India), Research scholar in spoken language identification. Om is awarded a copy of NIST Language Recognition Evaluation Test Set for his work in language identification.

Moshen Mohammadi: Iranian Research Institute for Electrical Engineering (Iran), PhD Candidate in Communications. Moshen is awarded copies of the 2008 NIST Speaker Recognition Evaluation Training Sets 1 and 2, the Evaluation Test Set and the Supplemental Set for his work in speaker recognition in noisy environments.

For program information visit the Data Scholarship page.

How to Share Data through LDC webinar on YouTube
LDC’s first webinar, How to Share Data through LDC, is now available for viewing on our YouTube page. Presented live on January 22, 2016, the webinar outlined in easy steps the process for submitting language resources to LDC for publication in the Catalog. In addition, discussion topics included the benefits of sharing data through LDC, the corpus life cycle, data delivery, quality control and more.

New Corpora
(1) BOLT Chinese Discussion Forums was developed by LDC and consists of 1,597,500 discussion forum threads in Chinese harvested from the Internet using a combination of manual and automatic processes.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. The material in this release represents the Chinese source data in the discussion forum genre.

Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Chinese content. Language identification was performed on all threads in this corpus (using CLD2), and threads for which the results indicated a high probability of largely non-Chinese content are identified in this release.

BOLT Chinese Discussion Forums is distributed via web download as a multi-part zip file. Consult the Using LDC Data page (https://www.ldc.upenn.edu/data-management/using) for more information about this format.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 (LDC2016T06).

These broadcast conversation recordings feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 142 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 (LDC2016S01).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 845,791 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


Thursday, January 21, 2016

LDC 2016 January Newsletter

CFP for LREC 2016 Novel Incentives Workshop

LDC Membership Discounts for MY 2016 Still Available

New publications:
______________________________________________________________

CFP for LREC 2016 Novel Incentives Workshop
The first workshop on novel incentives in linguistic data collection will take place on May 28, 2016 in conjunction with the Tenth International Conference on Language Resources and Evaluation (LREC2016) in Portoroz, Slovenia.

Novel Incentives for Collecting Linguistic Data and Annotation from People: types, implementation, tasking requirements, workflow and results, opens the discussion on incentives in data collection describing novel approaches and comparing traditional monetary incentives.

The workshop is accepting papers through February 6, 2016. For more information visit the workshop webpage.


LDC Membership Discounts for MY 2016 Still Available
If you are considering joining LDC for Membership Year 2016 (MY2016), there is still time to save on membership fees.  Any organization which joins or renews membership for 2016 through March 1, 2016, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2015 can receive a 10% discount on fees provided they renew prior to March 1, 2016. Publications planned for release in 2016 include multilingual language packs, BOLT discussion forum and DEFT narrative text corpora, HAVIC video clips and transcripts and the latest Arabic and Chinese treebanks.

New publications
(1) Arabic Treebank - Weblog was developed by LDC and consists of Arabic weblog data with part-of-speech, morphology, gloss and syntactic tree annotation.

The ongoing Penn Arabic Treebank Project (PATB) supports research in Arabic-language natural language processing and human language technology development. Generally, the PATB consists of two distinct phases: (a) part-of-speech (POS) tagging, which divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic treebanking, which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces and so on.

The data contains 243,117 source tokens before clitics were split, and 308,996 tree tokens after clitics were separated for treebank annotation. The source material is weblogs collected by LDC from various sources.

Arabic Treebank - Weblog is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) NewSoMe Corpus of Opinion in Blogs was compiled at Barcelona Media and consists of English and Spanish blogs annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.

LDC has also released NewSoMe Corpus of Opinion in News Reports (LDC2015T17).

The data consists of 108 English documents and 191 Spanish documents. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.


NewSoMe Corpus of Opinion in Blogs is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) GALE Phase 4 Chinese Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from newsgroup and weblog data collected by LDC and translated by LDC or under its direction.

GALE Phase 4 Chinese Weblog Parallel Sentences includes 231 source-translation document pairs, comprising 92,501 tokens of Chinese source text and its English translation.

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.


GALE Phase 4 Chinese Weblog Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee. 

Wednesday, December 16, 2015

LDC 2015 December Newsletter

Renew your LDC membership today

Spring 2016 LDC Data Scholarship Program - deadline approaching

LDC at LSA 2016

LDC to close for Winter Break

New publications
________________________________________________________________________

Renew your LDC membership today
Membership Year 2016 (MY2016) discounts are available for those who keep their membership current and join early in the year. Check here for further information including our planned publications for MY2016.

Now is also a good time to consider joining LDC for the current and open membership years, MY2015 and MY2014.  MY2015 includes data such as RATS Speech Activity Detection and updates to Penn Treebank. MY2014 remains open through the end of the 2015 calendar year and its publications include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. For full descriptions of these data sets, visit our Catalog.

Spring 2016 LDC Data Scholarship Program - deadline approaching
The deadline for the Spring 2016 LDC Data Scholarship Program is right around the corner! Student applications are being accepted now through January 15, 2016, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.

LDC at LSA 2016
LDC will be exhibiting at the Annual Meeting of the Linguistic Society of America, held January 7-10, 2016 in Washington, DC. Stop by booth 110 to learn more about recent developments at the Consortium and new publications. Also, be on the lookout for the following presentations:

Satellite Workshop: Preparing Your Corpus for Archival Storage
Malcah Yaeger-Dror (University of Arizona) and Christopher Cieri (LDC)
Thursday, January 7, 2016 - 8:00am to 3:00pm, Salon 4

Broadening connections among researchers in linguistics and human language technologies
Jeff Good (University at Buffalo) and Christopher Cieri (LDC)
Friday, January 8, 2016 - 7:30am to 9:00am, Salon 1

Diachronic development of pitch contrast in Seoul Korean
Sunghye Cho (UPenn), Yong-cheol Lee (Cheongju University) and Mark Liberman (LDC)
Friday, January 8, 2016 - 2:00pm to 5:00pm, Salon 1

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!  

LDC to close for Winter Break
LDC will be closed from Friday, December 25, 2015 through Friday, January 1, 2016 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Monday, January 4, 2016. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.

New publications
(1) 2006 CoNLL Shared Task - Arabic & Czech consists of Arabic and Czech dependency treebanks used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing.

This corpus is cross listed with ELRA as ELRA-W0087.

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006, the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page.

This source data in this release consists principally of news and journal texts. The individual data sets are subsets of the following:
2006 CoNLL Shared Task - Arabic & Czech is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.


*
(2) 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish.

This corpus is cross listed and jointly released with ELRA as ELRA-W0086.

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006 , the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page.

The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. 
The individual data sets are:
2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora.  This data is being made available at no-cost for non-member organizations under a research license.
*
(3) GALE Phase 3 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 150 hours of Mandarin Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast News Transcripts (LDC2015T25).

The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: Anhui TV, China Central TV (CCTV), Phoenix TV and Voice of America (VOA).

This release contains 279 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 3 Chinese Broadcast News Speech is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.
*
(4) GALE Phase 3 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 150 hours of Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 3 Chinese Broadcast News Speech (LDC2015S13).

The broadcast news recordings for transcription feature news broadcasts focusing principally on current events from the following sources: Anhui TV,  China Central TV (CCTV), Phoenix TV and Voice of America (VOA).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,933,695 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. 

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Chinese Broadcast News Transcripts is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.