Linguistic Data Consortium: November 2014

Fall 2014 Data Scholarship Recipients

Invitation to Join for Membership Year (MY) 2015

Spring 2015 Data Scholarship Program

LDC is now on Twitter

LDC closed for Thanksgiving Break

New publications:

Boulder Lies and Truth
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2
GALE Phase 2 Chinese Web Parallel Text

Fall 2014 Data Scholarship Recipients

LDC is pleased to announce the student recipients of the Fall 2014 LDC Data Scholarship program. The following students will receive no-cost copies of LDC data:

Mohammed Abumatar ~ University of Jordan (Jordan), Bsc Candidate, Computer Engineering. Mohammed has been awarded a copies of MADCAT Phase 1-3 Training Data for his work in handwriting recognition.

Ramy Baly ~ American University of Beirut (Lebanon), PhD candidate, Electrical and Computer Engineering. Ramy has been awarded a copies of Arabic Treebank Parts 1-3 for his work in opinion mining.

Abbas Khosravanai ~ Amirkabir University of Technology (Iran), PhD candidate, Computer Engineering. Abbas has been awarded a copy of 2008 NIST Speaker Recognition for his work in robust speaker recognition.

Phuc Nguyen ~ University of North Texas (USA), PhD candidate, Computer Science and Engineering. Phuc has been awarded a copy of Message Understanding Conference (MUC) 7 for his work in named entity recognition.

Invitation to Join for Membership Year (MY) 2015
Membership Year (MY) 2015 is open for joining. We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium. For MY2015, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase. Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.

The details of our early renewal discounts for MY2015 are as follows:

Organizations who joined for MY2014 will receive a 10% discount when renewing before March 2, 2015. After March 2, 2015, MY2014 members are eligible for a 5% discount when renewing through the end of the year.

New members as well as organizations who did not join for MY2014, but who held membership in any of the previous MYs (1993-2013), will also be eligible for a 5% discount provided that they join/renew before March 2, 2015.

Publications for MY2015 are still being planned but we plan to release the following:

CIEMPIESS - Mexican Spanish radio broadcast audio and transcripts
GALE Phase 3 and 4 data – all tasks and languages
Mandarin Chinese Phonetic Segmentation and Tone Corpus - phonetic segmentation and tone labels
RATS Speech Activity Detection – multilanguage audio for robust speech detection and language identification
SEAME - Mandarin-English code-switching speech
SenSem Spanish and Catalan Lexicon and Databank - sentence semantics and verbal lexicons

Spring 2015 Data Scholarship Program

Applications are now being accepted through Thursday, January 15, 2015, 11:59PM EST for the Spring 2015 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 40 individual students and student research groups. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full non-member fee for the data or to join the Consortium.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Spring 2015 program cycle is January 15, 2015, 11:59PM EST.

LDC is now on Twitter
LDC now has a Twitter feed. Start following us today for updates on new corpora releases and the latest LDC news.

LDC closed for Thanksgiving Break
LDC will be closed on Thursday, November 27, 2014 and Friday, November 28, 2013 in observance of the US Thanksgiving Holiday. Our offices will reopen on Monday, December 1, 2014.

New publications

(1) Boulder Lies and Truth was developed at the University of Colorado Boulder and contains approximately 1,500 elicited English reviews of hotels and electronics for the purpose of studying deception in written language. Reviews were collected by crowd-sourcing with Amazon Medical Turk.

Each review was required to be original and was checked for plagiarism against the web. Reviews were annotated with respect to the following three dimensions:

Domain: Electronics (e.g., iPhone) or Hotels

Sentiment: Positive or Negative

Truth Value:

a) Truthful: a review about an object known by the writer reflecting the real sentiment of the writer toward the object of the review

b) Opposition: A review about an object known by the writer reflecting the opposite sentiment of the writer toward the object of the review (i.e., if the writer liked the object they were asked to write a negative review; if the writer did not like the object, they were asked to write a positive review)

c) Deceptive (i.e., fabricated): a review written about an object not known by the writer either positive or negative in sentiment; the objects reviewed were provided via a URL from the tasks in (a) and (b)

Each review was judged a total of 30 times: (1) 10 times to evaluate its perceived quality (on a range from 1-5); (2) 10 times with judgments about its perceived truthfulness (e.g., truthful or somehow deceptive, a lie or a fabrication); and (3) 10 times for its perceived sentiment (i.e., star rating).

Boulder Lies and Truth is distributed via web download.

2014 Subscription Members will receive two copies of this data on disc, provided they have completed the user license agreement. 2014 Standard Members may request a copy as part of their 16 free membership corpora. This data is available at no-cost for non-members under the same user license agreement.

(2) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 was developed by LDC and contains 65,069 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) programming collected by LDC in 2008.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging eight different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) GALE Phase 2 Chinese Web Parallel Text was developed by LDC and along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.

This release includes 46 source-translation document pairs, comprising 66,779 tokens of translated data. Data is drawn from four Chinese weblog and newsgroup sources.

Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were formatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 2 Chinese Web Parallel Text is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, November 17, 2014

LDC 2014 November Newsletter