Tuesday, July 19, 2016

LDC July 2016 Newsletter

Fall 2016 Data Scholarship Program

2015 User Survey Results

New Publications:
______________________________________________________________________

Fall 2016 Data Scholarship Program

Applications are now being accepted through Thursday, September 15, 2016 for the Fall 2016 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a two-page proposal describing their intended use of the data. The proposal should state which data the student plans to use, how the data will benefit their research project, the proposed methodology or algorithm which will be used and how success will be measured.

Applicants should consult the Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must be signed and printed on letterhead, describe the student and the research, evaluate the probability of success and confirm that the department or university lacks the funding to pay the full non-member fee for the data. 

For further information on application materials and program rules, please visit the LDC Data Scholarship page.


2015 User Survey Results
LDC conducted its fourth user survey in December 2015. This survey built on the previous surveys conducted in 2006, 2007 and 2012 to assess user sentiment and also asked for the evaluation of key LDC-related topics including:
·         Opinions on the new website and usability of the Catalog
·         Use and satisfaction with the enhanced user services and e-commerce system
·         LDC’s Data Management Plan capabilities
·         Suggestions for future publications and preferred data delivery methods
·         Use of web services for data access and processing

Overall, survey respondents were satisfied with LDC’s data, membership options, website, Catalog and enhanced user services. Participants cited the top five most useful corpora received between 2012 and 2015 as OntoNotes Release 5.0TIMITTAC KBP Reference Knowledge BasePenn Discourse Treebank V 2.0, and Multi-Channel WSJ Audio. Three fourths of respondents prefer digital delivery of data and the top three languages for current research demands were identified as English, Chinese and Spanish.

We thank everyone who participated in this survey. Responses will benefit the future of the Consortium and will help LDC to better meet the needs of our members and data licensees.


New Corpora

(1) English Speed Networking Conversational Transcripts was developed at the University of the West of England and contains 388 transcripts of English face-to-face and instant messaging conversations  about business ideas collected in 2014 and 2015 from participants (undergraduate students) playing different power roles.

This corpus was created to examine communication accommodation, specifically, the ways in which an individual's linguistic style is affected by social power and personality. The data was collected in two studies. In the first study, 40 participants had a series of paired five minute face-to-face conversations playing either a high, low or neutral power role. The same procedure was followed in the second study except that participants discussed business ideas via instant messaging.

The face-to-face conversations were audio-recorded and transcribed verbatim.

All transcripts are presented as UTF-8 plain text files.

English Speed Networking Conversational Transcripts is distributed via web download.
2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $400.00

*

(2) Digital Archive of Southern Speech - NLP Version (DASS-NLP) was developed by LDC as an alternate version of Digital Archive of Southern Speech (DASS) (LDC2012S03) suitable for natural language processing and human language technology applications. Specifically, the original audio files have been converted to 16kHz 16-bit flac compressed wav and file names have been normalized to facilitate automatic processing.

DASS was developed by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in turn part of the Linguist Atlas Project (LAP). DASS-NLP contains approximately 366 hours of English speech data from 30 female speakers and 34 male speakers, along with associated metadata about the speakers, the recordings and maps in .jpeg format relating to the recording locations.

LAP consists of a set of survey research projects about the words and pronunciation of everyday American English, the largest project of its kind in the United States. Interviews with thousands of native speakers across the country have been carried out since 1929. LAGS surveyed the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews conducted from 1968-1983.

The speakers' average age is 61 years; there are 30 women and 34 men from the Gulf States region represented in this release. The interviews cover common topics such as family, the weather, household articles and activities, agriculture and social conditions.   

Digital Archive of Southern Speech - NLP Version is distributed via web download.

2016 Not-for-Profit Subscription Members will automatically receive two copies of this corpus. 2016 For-Profit Subscription Members will receive two copies provided they have submitted a completed copy of the For-Profit Member User License Agreement for Digital Archive of Southern Speech – NLP Version (LDC2016S05). 2016 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.

 *

(3) GALE Phase 3 and 4 Chinese Broadcast News Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast news data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 3 and 4 Chinese Broadcast News Parallel Text includes 76 source-translation document pairs, comprising 614,608 tokens of Chinese source text and its English translation. Data is drawn from 16 distinct Chinese programs broadcast between 2006 and 2008 by China Central TV, a national and international broadcaster in Mainland China and Phoenix TV, a Hong Kong-based satellite television station. The programs in this release feature news programs on current events topics.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC.

Source data and translations are distributed in TDF format. All data are encoded in UTF-8.

GALE Phase 3 and 4 Chinese Broadcast News Parallel is distributed via web download

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1750.00

 *

(4IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 215 hours of Cantonese conversational and scripted telephone speech collected in 2011 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Cantonese speech in this release represents that spoken in the Chinese provinces of Guangdong and Guangxi, and within those provinces, among five dialect groups. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: simplified Chinese characters and a romanization scheme based on the Yale system, both encoded in UTF-8.

IARPA Babel Cantonese Language Pack IARPA is distributed via web download

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the IARPA User Agreement for Not-for-Profit Members or the IARPA User Agreement for For-Profit Members. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $25.00 under a research license


Thursday, June 16, 2016

LDC June 2016 Newsletter

Commercial use and LDC data

New publications:
_______________________________________________________________
Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for more information.

New Corpora

(1) Chinese Treebank 9.0 consists of approximately two million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech. This new data set in the Chinese Treebank series adds more annotated web data and two new genres – chat messages and transcribed telephone speech.

There are 3,726 text files in this release, containing 132,076 sentences, 2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is provided in the UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked.

Chinese Treebank 9.0 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker metadata. The goal of this work was to support spoken term detection and forensic speaker identification.

This corpus is comprised of Mexican Spanish microphone speech from 75 male speakers and 75 female speakers in a quiet office environment. Speakers could answer pre-selected open questions or describe a particular painting shown to them on a computer monitor. Speaker metadata in this release includes age, gender, place of birth, place of residence and parents' nationalities.

CHM150 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.

*

(3) GALE Phase 4 Arabic Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations, selected from newsgroup and weblog data collected by LDC and translated by LDC or under its direction.

The data includes 1,067 source-translation document pairs, comprising 68,346 words (Arabic source) of translated data. 

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 4 Arabic Weblog Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, May 16, 2016

LDC May 2016 Newsletter

LDC at LREC 2016

New publications:
GALE Phase 4 Chinese Broadcast Conversation Speech
GALE Phase 4 Chinese Broadcast Conversation Transcripts 
_______________________________________________________________

LDC at LREC 2016

LDC will attend the 10th Language Resource Evaluation Conference (LREC2016), hosted by ELRA, the European Language Resource Association. The conference will be held in Portorož, Slovenia from May 23-28 and features a broad range of sessions on language resources and human language technologies research. Seven LDC staff members will be presenting current work on topics including trends in HLT research, building language resources for autism spectrum disorders, data management plans, rapid development of morphological analyzers for typologically diverse languages, selection criteria for low resource language programs, multi-language speech collection for NIST LRE, novel incentives for collecting data and annotation from people, and more.

Following the conference, LDC’s presented papers and posters will be available on LDC’s Papers Page.


New Corpora

(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing consists of data, tools, system results, and publications associated with the 2014 and 2015 tasks on Broad-Coverage Semantic Dependency Parsing (SDP) conducted in conjunction with the International Workshop on Semantic Evaluation (SemEval) and was developed by the SDP task organizers.

SemEval is an ongoing series of evaluations of computational semantic analysis systems intended to explore the nature of meaning in language. It evolved from the Senseval word sense disambiguation series to include semantic analysis tasks outside of word sense disambiguation.

This release is based on English, Chinese and Czech data from the following resources: Treebank-2 LDC95T17, Proposition Bank I LDC2004T14, NomBaank v 1.0 LDC2008T23 and CCGBank LDC2005T13 (English); Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21) (Chinese); and Prague Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01) (Czech).

The results are presented as graphs in three target representations: MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures (PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional target representation CCGbank was converted to semantic dependency graphs (in the subdirectory ‘ccd’).


SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 4 Chinese Broadcast Conversation Speech was developed by LDC and is comprised of approximately 172 hours of Mandarin Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast Conversation Transcripts (LDC2016T12).

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 236 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.


GALE Phase 4 Chinese Broadcast Conversation Speech is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALE Phase 4 Chinese Broadcast Conversation Transcripts was developed by LDC and contains transcriptions of approximately 172 hours of Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast Conversation Speech (LDC2016S03).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 2,259,952 tokens.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR). QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. QRTR adds additional structural information such as topic boundaries and manual sentence unit annotation.


GALE Phase 4 Chinese Broadcast Conversation Transcripts is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


Monday, April 18, 2016

LDC April 2016 Newsletter

New publications:

_________________________________________________________________________

New Corpora

(1) H1 Children's Writing was developed by the Cooperative State University Baden-WürttembergUniversity of Education. It consists of 996 texts written over three months by 88 German school children age seven through eleven years.

Texts were written within regular class settings. The students were presented with a picture and were asked to write a story, to describe the picture or if unable to write a text, to list what they saw in the picture. The pictures were designed to enhance the output with respect to important spelling error categories, namely, the marking of short vowels with a silent consonant letter and the correct spelling of the long vowel. The children were allowed at least 15 minutes to write the texts. This exercise was repeated weekly for 12 weeks.

Most of the participants were multilingual. The metadata with this releases includes: school week of collection; school type (always elementary school); age; gender; grade/classroom; language spoken at home; and school materials used for German (Jojo).

In all, 996 texts representing 62,764 tokens were collected. The texts were digitized in two forms: (1) the original text, including all errors (achieved), and (2) the intended (target) text, where all spelling errors were removed. Annotations were added to both the achieved text and the target text to distinguish words that should not be analyzed for spelling errors, such as names or foreign words. For sentence-level analysis, syntax errors were annotated by marking substitutions, deletions and insertions at the word level. In such cases, the used word was analyzed for spelling, and the correct word was used for sentence structure analysis.

Original handwriting is presented as pdf documents and the converted text as UTF-8 plain text in csv documents.

H1 Children's Writing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source sentences and corresponding English translations selected from broadcast conversation data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

The data includes 170 source-translation document pairs, comprising 44,064 words (Arabic source) of translated data. Data is drawn from 45 distinct Arabic broadcast conversation sources.

GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) HAVIC Pilot Transcription was developed by LDC and is comprised of approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos. This data set was created in collaboration with NIST (the National Institute of Standards and Technology) as part of the HAVIC (the Heterogeneous Audio Visual Internet Collection) project, the goal of which is to advance multimodal event detection and related technologies.

LDC has developed a large, heterogeneous, annotated multimodal corpus for HAVIC that has been used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC Pilot Transcription supported an experiment to produce a verbatim transcript (quick and rich transcription) based on audio extracted from user-generated videos. It contains the pilot transcripts for selected MED 2011 video files as well as the associated videos.

Annotators generated the transcripts using XTrans, which supports manual transcription across multiple channels, languages and platforms. HAVIC transcription guidelines are included in the documentation for this release. All transcription files are in .tdf format, a plain-text, flat-table format with 13 tab-delimited fields. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution.

HAVIC Pilot Transcription is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee. 

Tuesday, March 15, 2016

LDC March 2016 Newsletter

New publications:

GALE Phase 3 and 4 Arabic Web Parallel Text 
GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
_____________________________________________________________________

New Corpora
(1) DEFT Narrative Text was developed by LDC and contains proxy reports and their source newswire used to support DARPA's Deep Exploration and Filtering of Text (DEFT) program. One of the goals of the DEFT program was to develop technologies that can perform various NLP tasks on data in a variety of genres, both formal and informal.

LDC provided source data and annotations for DEFT system development. DEFT Narrative Text consists of "proxy reports" (and "multi-proxy reports") in English. (Multi-)proxy reports are intended to mimic the format and other features of some types of government analyst reports using content from newswire articles. The corresponding English newswire source documents are also included in the release.

LDC staff manually selected the source newswire from English Gigaword Fifth Edition (LDC2011T07).

The newswire source documents are XML files following the Gigaword corpus format. The proxy reports are in plain text format.

DEFT Narrative Text is distributed via web download.
2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALEPhase 3 and 4 Arabic Web Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.

The data includes 124 source-translation document pairs, comprising 61,662 tokens of Arabic source text and its English translation. Data is drawn from four various Arabic weblog and newsgroup sources.

GALE Phase 3 and 4 Arabic Web Parallel Text is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALEPhase 3 and 4 Chinese Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast conversation data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

This data includes 63 source-translation document pairs, comprising 487,466 tokens of Chinese source text and its English translation. Data is drawn from 19 distinct Chinese programs broadcast between 2006 and 2008.

Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Monday, February 15, 2016

LDC 2016 February Newsletter

­­­­­­­­­­­­­Only two weeks left to enjoy 2016 membership savings

Spring 2016 LDC Data Scholarship recipients

How to Share Data through LDC webinar on YouTube

New publications:
_______________________________________________________________________

Only two weeks left to enjoy 2016 membership savings
There’s still time to save on 2016 membership fees. Now through March 1, all organizations receive a 5% discount when they join for MY2016. MY2015 members are eligible for an additional 5% off the fee (10% total savings) when they renew before March 1.  

To join, create or sign into your LDC user account, select your preferred membership type from the Catalog, add the item to your bin and follow the check-out process. The Membership Office will apply any discounts. Alternatively, if you have already received a renewal invoice from LDC, you can simply pay against that.

For more information on the benefits of membership, visit Join LDC.

Spring 2016 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Spring 2016 data scholarships:

Shefali Waldekar: Indian Institute of Technology Kharagpur (India), PhD Candidate, Electronics and Electrical Communications Engineering. Shefali is awarded copies of 2002 Rich Transcription Broadcast News and Conversational Telephone Speech and 2005 Spring NIST Rich Transcription (RT05-S) Evaluation Set for her research in audio diarization.

Nikola Invanov Nikolov: University of Zurich and ETH Zurich (Switzerland), MSc candidate in Informatics. Nikola is awarded a copy of Annotated English Gigaword for his research in text summarization.

Om Prakash Singh: Indian Institute of Technology, Guwahati (India), Research scholar in spoken language identification. Om is awarded a copy of NIST Language Recognition Evaluation Test Set for his work in language identification.

Moshen Mohammadi: Iranian Research Institute for Electrical Engineering (Iran), PhD Candidate in Communications. Moshen is awarded copies of the 2008 NIST Speaker Recognition Evaluation Training Sets 1 and 2, the Evaluation Test Set and the Supplemental Set for his work in speaker recognition in noisy environments.

For program information visit the Data Scholarship page.

How to Share Data through LDC webinar on YouTube
LDC’s first webinar, How to Share Data through LDC, is now available for viewing on our YouTube page. Presented live on January 22, 2016, the webinar outlined in easy steps the process for submitting language resources to LDC for publication in the Catalog. In addition, discussion topics included the benefits of sharing data through LDC, the corpus life cycle, data delivery, quality control and more.

New Corpora
(1) BOLT Chinese Discussion Forums was developed by LDC and consists of 1,597,500 discussion forum threads in Chinese harvested from the Internet using a combination of manual and automatic processes.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. The material in this release represents the Chinese source data in the discussion forum genre.

Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Chinese content. Language identification was performed on all threads in this corpus (using CLD2), and threads for which the results indicated a high probability of largely non-Chinese content are identified in this release.

BOLT Chinese Discussion Forums is distributed via web download as a multi-part zip file. Consult the Using LDC Data page (https://www.ldc.upenn.edu/data-management/using) for more information about this format.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 (LDC2016T06).

These broadcast conversation recordings feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 142 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 (LDC2016S01).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 845,791 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


Thursday, January 21, 2016

LDC 2016 January Newsletter

CFP for LREC 2016 Novel Incentives Workshop

LDC Membership Discounts for MY 2016 Still Available

New publications:
______________________________________________________________

CFP for LREC 2016 Novel Incentives Workshop
The first workshop on novel incentives in linguistic data collection will take place on May 28, 2016 in conjunction with the Tenth International Conference on Language Resources and Evaluation (LREC2016) in Portoroz, Slovenia.

Novel Incentives for Collecting Linguistic Data and Annotation from People: types, implementation, tasking requirements, workflow and results, opens the discussion on incentives in data collection describing novel approaches and comparing traditional monetary incentives.

The workshop is accepting papers through February 6, 2016. For more information visit the workshop webpage.


LDC Membership Discounts for MY 2016 Still Available
If you are considering joining LDC for Membership Year 2016 (MY2016), there is still time to save on membership fees.  Any organization which joins or renews membership for 2016 through March 1, 2016, is entitled to a 5% discount on membership fees.  Organizations which held membership for MY2015 can receive a 10% discount on fees provided they renew prior to March 1, 2016. Publications planned for release in 2016 include multilingual language packs, BOLT discussion forum and DEFT narrative text corpora, HAVIC video clips and transcripts and the latest Arabic and Chinese treebanks.

New publications
(1) Arabic Treebank - Weblog was developed by LDC and consists of Arabic weblog data with part-of-speech, morphology, gloss and syntactic tree annotation.

The ongoing Penn Arabic Treebank Project (PATB) supports research in Arabic-language natural language processing and human language technology development. Generally, the PATB consists of two distinct phases: (a) part-of-speech (POS) tagging, which divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic treebanking, which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces and so on.

The data contains 243,117 source tokens before clitics were split, and 308,996 tree tokens after clitics were separated for treebank annotation. The source material is weblogs collected by LDC from various sources.

Arabic Treebank - Weblog is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) NewSoMe Corpus of Opinion in Blogs was compiled at Barcelona Media and consists of English and Spanish blogs annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.

LDC has also released NewSoMe Corpus of Opinion in News Reports (LDC2015T17).

The data consists of 108 English documents and 191 Spanish documents. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.


NewSoMe Corpus of Opinion in Blogs is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) GALE Phase 4 Chinese Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from newsgroup and weblog data collected by LDC and translated by LDC or under its direction.

GALE Phase 4 Chinese Weblog Parallel Sentences includes 231 source-translation document pairs, comprising 92,501 tokens of Chinese source text and its English translation.

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.


GALE Phase 4 Chinese Weblog Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.