Linguistic Data Consortium: 2013

Tuesday, December 17, 2013

LDC December 2013 Newsletter

Spring 2014 LDC Data Scholarship Program - deadline approaching

LDC to close for Winter Break

New publications:

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1

Maninkakan Lexicon

The ARRAU Corpus of Anaphoric Information

Spring 2014 LDC Data Scholarship Program - deadline approaching

The deadline for the Spring 2014 LDC Data Scholarship Program is right around the corner. Student applications are being accepted now through January 15, 2014, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.

LDC to close for Winter Break

LDC will be closed from Wednesday, December 25, 2013 through Wednesday, January 1, 2014 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2014. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.
Best wishes for a happy holiday season!

New publications

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 was developed by LDC and contains 179,842 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2005 - 2007.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging 8 different types of links
Identifying, attaching, and tagging local-level unmatched words
Identifying and tagging sentence/discourse-level unmatched words
Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee

Maninkakan Lexicon was developed by LDC and contains 5,834 entries of the Maninkakan language presented as a Maninkakan-English lexicon and a Maninkakan-French lexicon. It is the second publication in an ongoing LDC project to to build an electronic dictionary of four Mandekan languages: Mawukakan, Maninkakan, Bambara and Jula. These are Eastern Manding languages in the Mande Group of the Niger-Congo language family. LDC released a Mawukakan Lexicon (LDC2005L01) in 2005.

More information about LDC’s work in the languages of West Africa and the challenges those languages present for language resource development can be found here.

Maninkakan is written using Latin script, Arabic script and the NKo alphabet. This lexicon is presented using a Latin-based transcription system because the Latin alphabet is familiar to the majority of Mandekan language speakers and because it is expected to facilitate the work of researchers interested in this resource.

The dictionary is provided in two formats, Toolbox and XML. Toolbox is a version of the widely used SIL Shoebox program adapted to display Unicode. The Toolbox files are provided in two fonts, Arial and Doulous SIL. The Arial files should display using the Arial font which is standard on most operating systems. Doulous SIL, available as a free download, is a robust font that should display all characters without issue. Users should launch Toolbox using the *.prj files in the Arial or Doulous_SIL folders.

Maninkakan Lexicon is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

The ARRAU (Anaphora Resolution and Underspecification) Corpus of Anaphoric Information was developed by the University of Essex and the University of Trento. It contains annotations of multi-genre English texts for anaphoric relations with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans.

The source texts in this release include task-oriented dialogues from the TRAINS-91 and TRAINS-93 corpora (the latter released through LDC, TRAINS Spoken Dialog Corpus LDC95S25), narratives from the English Pear Stories, articles from the Wall Street Journal portions of the Penn Treebank (Treebank-2 LDC95T7) and the RST Discourse Treebank LDC2002T07, and the Vieira/Poesio Corpus which consists of training and test files from Treebank-2 and RST Discourse Treebank.

The texts were annotated using the ARRAU guidelines which treat all noun phrases (NPs) as markables. Different semantic roles are recognized by distinguishing between referring expressions (that update or refer to a discourse model), and non-referring ones (including expletives, predicative expressions, quantifiers, and coordination). A variety of linguistic features were also annotated, including morphosyntactic agreement, grammatical function, semantic type (person, animate, concrete, action, time, other abstract) and genericity. The annotation was carried out using the MMAX2 annotation tool which allows text units to be marked at different levels.

The files in MMAX format have been organized so that they can be visualized using the MMAX2 tool or directly used as input/output for the BART toolkit which performs automatic coreference resolution including all necessary preprocessing steps.

The ARRAU Corpus of Anaphoric Information is distributed via web download.

2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, November 18, 2013

LDC November 2013 Newsletter

Invitation to Join for Membership Year 2014
Spring 2014 LDC Data Scholarship Program
LDC to Close for Thanksgiving Break

New publications:

Chinese Treebank 8.0
CSC Deceptive Speech

Invitation to Join for Membership Year (MY) 2014

Membership Year (MY) 2014 is open for joining. We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium. For MY2014, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase. Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.

The details of our early renewal discounts for MY2014 are as follows:

·   Organizations who joined for MY2013 will receive a 5% discount when renewing. This discount will apply throughout 2014, regardless of time of renewal. MY2013 members renewing before Monday, March 3, 2014 will receive an additional 5% discount, for a total 10% discount off the membership fee.

·    New members as well as organizations who did not join for MY2013, but who held membership in any of the previous MYs (1993-2012), will also be eligible for a 5% discount provided that they join/renew before March 3, 2014.

Not-for-Profit/US Government
Standard US$2400 (MY 2014 Fee)
              US$2280 (with 5% discount)*
              US$2160 (with 10% discount)**

Subscription US$3850 (MY 2014 Fee)
                    US$3658 (with 5% discount)*
                    US$3465 (with 10% discount)**

For-Profit
Standard US$24000 (MY 2014 Fee)
               US$22800 (with 5% discount)*
               US$21600 (with 10% discount)**

Subscription US$27500 (MY 2014 Fee)
                    US$26125 (with 5% discount)*
                    US$24750 (with 10% discount)**

* For new members, MY2013 Members renewing for MY2014, and any previous year Member who renews before March 3, 2014

** For MY2013 Members renewing before March 3, 2014

Publications for MY2014 are still being planned; here are the working titles of data sets we intend to provide:

2009 NIST Language Recognition Evaluation
Callfriend Farsi Speech and Transcripts
GALE data -- all phases and genres
Hispanic-English Speech
MADCAT Phase 4 Training
MALACH Czech ASR
NIST OpenMT Five Language Progress Set

In addition to receiving new publications, current year members of LDC also enjoy the benefit of licensing older data at reduced costs; current year for-profit members may use most data for commercial applications.

Spring 2014 LDC Data Scholarship Program

Applications are now being accepted through Wednesday, January 15, 2014, 11:59PM EST for the Spring 20143 LDC Data Scholarship program! The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 35 individual students and student research groups.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

Applicants should consult the LDC Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two datasets; students may apply for additional datasets during the following cycle once they have completed processing of the initial datasets and publish or present work in some juried venue.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full Non-member Fee for the data or to join the Consortium.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Spring 2014 program cycle is January 15, 2014, 11:59PM EST.

LDC to Close for Thanksgiving Break

LDC will be closed on Thursday, November 28, 2013 and Friday, November 29, 2013 in observance of the US Thanksgiving Holiday. Our offices will reopen on Monday, December 2, 2013.

New publications

Chinese Treebank 8.0 consists of approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups and weblogs.

The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and then moved to Brandeis University. The project’s goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T08), released in 2010, added new annotated newswire data, broadcast material and web text to the approximate total of one million words. Chinese Treebank 8.0 adds new annotated data from newswire, magazine articles and government documents.

There are 3,007 text files in this release, containing 71,369 sentences, 1,620,561 words, 2,589,848 characters (hanzi or foreign). The data is provided in UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the segmentation, POS-tagging and bracketing guidelines included in the release. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked.

Chinese Treebank 8.0 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc.2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

CSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interview from 32 native speakers of Standard American English (16 male, 16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.

The participants were told that they were participating in a communication experiment which sought to identify people who fit the profile of the top entrepreneurs in America. To this end, the participants performed tasks and answered questions in six areas. Tthey were later told that they had received low scores in some of those areas and did not fit the profile. The subjects then participated in an interview where they were told to convince the interviewer that they had actually achieved high scores in all areas and that they did indeed fit the profile. The task of the interviewer was to determine how he thought the subjects had actually performed, and he was allowed to ask them any questions other than those that were part of the performed tasks. For each question from the interviewer, subjects were asked to indicate whether the reply was true or contained any false information by pressing one of two pedals hidden from the interviewer under a table.

Interviews were conducted in a double-walled sound booth and recorded to digital audio tape on two channels using Crown CM311A Differoid headworn close-talking microphones, then down sampled to 16kHz before processing.

The interviews were orthographically transcribed by hand using the NIST EARS transcription guidelines. Labels for local lies were obtained automatically from the pedal-press data and hand-corrected for alignment, and labels for global lies were annotated during transcription based on the known scores of the subjects versus their reported scores. The orthographic transcription was force-aligned using the SRI telephone speech recognizer adapted for full-bandwidth recordings. There are several segmentations associated with the corpus: the implicit segmentation of the pedal presses, derived semi-automatically sentence-like units (EARS SLASH-UNITS or SUs) which were hand labeled, intonational phrase units and the units corresponding to each topic of the interview.

CSC Deceptive Speech is distributed on 1 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data provided they have completed and returned the User License Agreement for CSC Deceptive Speech (LDC2013S09). 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, October 16, 2013

LDC October 2013 Newsletter

Fall 2013 LDC Data Scholarship Recipients

New publications:

GALE Phase 2 Chinese Broadcast News Speech

GALE Phase 2 Chinese Broadcast News Transcripts

OntoNotes Release 5.0

Fall 2013 LDC Data Scholarship Recipients

LDC is pleased to announce the student recipients of the Fall 2013 LDC Data Scholarship program. This program provides university and college students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen six proposals to support. The following students will receive no-cost copies of LDC data:

Shamama Afnan - Clemson University (USA), MS candidate, Electrical Engineering. Shamana has been awarded a copy of 2008 NIST Speaker Recognition Training and Test data for her work in speaker recognition.

Seyedeh Firoozabadi - University of Connecticut (USA), PhD candidate, Biomedical Engineering. Seyedeh has been awarded a copy of TIDIGITS and TI-46 Word for her work in speech recognition.

Lei Liu - Beijing Foreign Studies University (China), PhD candidate, Foreign Language Education. Lei has been awarded a copy of Treebank-3 and Prague Czech-English Dependency Treebank 2.0 for his work in parsing.

Monisankha Pal - Indian Institute of Technology, Kharagpur (India), PhD candidate, Electronics and Electrical Communication Engineering. Monisankha has been awarded a copy of CSR-I (WSJ0) and CSR-II (WSJ1) for his work in speaker recognition.

Sachin Pawar - Indian Institute of Technology, Bombay (India), PhD candidate, Computer Science and Engineering. Sachin has been awarded a copy of ACE 2004 Multilingual Training Corpus for his work in named-entity recognition.

Sergio Silva - Federal University of Rio Grande do Sul (Brazil), MS candidate, Computer Science. Sergio has been awarded a copy of 2004 and 2005 Spring NIST Rich Transcription data for his work in diarization.

New publications

(1) GALE Phase 2 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 126 hours of Mandarin Chinese broadcast news speech collected in 2006 and 2007 by the Linguistic Data Consortium (LDC) and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 2 Chinese Broadcast News Transcripts (LDC2013T20).

Broadcast audio for the GALE program was collected at LDC's Philadelphia, PA USA facilities and at three remote collection sites: HKUST (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature news broadcasts focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; and Phoenix TV, a Hong Kong-based satellite television station.

This release contains 248 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings, as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded, and as a guide for data selection by retaining information about a program's genre, data type and topic.

GALE Phase 2 Chinese Broadcast News Speech is distributed on 2 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corporal. Nonmembers may license this data for a fee.

(2) GALE Phase 2 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 110 hours of Chinese broadcast news speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 2 Chinese Broadcast News Speech (LDC2013S08).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,593,049 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript.

GALE Phase 2 Chinese Broadcast News Transcripts is distributed via web download. Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corporal. Nonmembers may license this data for a fee.

(3) OntoNotes Release 5.0 is the final release of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project was to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

OntoNotes Release 5.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04, OntoNotes Release 3.0 LDC2009T24 and OntoNotes Release 4.0 LDC2011T03 -- and adds source data from and/or additional annotations for, newswire (News), broadcast news (BN), broadcast conversation (BC), telephone conversation (Tele) and web data (Web) in English and Chinese and newswire data in Arabic. Also contained is English pivot text (Old Testament and New Testament text). This cumulative publication consists of 2.9 million words

The OntoNotes project built on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation includes word sense disambiguation for nouns and verbs, with some word senses connected to an ontology, and coreference.

Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database (ontonotes-v5.0.sql.gz) with a Python API to provide convenient cross-layer access.

OntoNotes Release 5.0 is distributed on 1 DVD-ROM. Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corporal. Nonmembers may license this data at no charge subject to shipping and handling fees.

Tuesday, September 17, 2013

LDC September 2013 Newsletter

New LDC Website Coming Soon

LDC Spoken Language Sampler - 2nd Release

New publications:

GALE Phase 2 Arabic Broadcast Conversation Speech Part 2

GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2

Semantic Textual Similarity (STS) 2013 Machine Translation

New LDC Website Coming Soon

Look for LDC's new website in the coming weeks. We've revamped the design and site plan to make it easier than ever to find what you're looking for. The features you use the most -- the catalog, new corpus releases and user login -- will be a short click away. We expect the LDC website to be occasionally unavailable for a few days at the end of September as we make the switch and thank you in advance for your understanding.

LDC Spoken Language Sampler - 2nd Release

The LDC Spoken Language Sampler – 2nd Release is now available. It contains speech and transcript samples from recent releases and is available at no cost. Follow the link above to the catalog page, download and browse.

New publications:

(1) GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 was developed by LDC and is comprised of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program. The data was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites. The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC's local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular; all signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output.

The broadcast conversation recordings in this release feature interviews, call-in programs and round table discussions focusing principally on current events from several sources. This release contains 141 audio files presented in .wav, 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 is distributed on 2 DVD-ROM.

2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data fora fee

(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 128 hours of Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. The source broadcast conversation recordings feature interviews, call-in programs and round table discussions focusing principally on current events from several sources.

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 763,945 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 2 is distributed via web download.

(3) Semantic Textual Similarity (STS) 2013 Machine Translation was developed as part of the STS 2013 Shared Task which was held in conjunction with *SEM 2013, the second joint conference on lexical and computational semantics organized by the ACL (Association of Computational Linguistics) interest groups SIGLEX and SIGSEM. It is comprised of one text file containing 750 English sentence pairs translated from the Arabic and Chinese newswire and web data sources.

The goal of the Semantic Textual Similarity (STS) task was to create a unified framework for the evaluation of semantic textual similarity modules and to characterize their impact on natural language processing (NLP) applications. STS measures the degree of semantic equivalence. The STS task was proposed as an attempt at creating a unified framework that allows for an extrinsic evaluation of multiple semantic components that otherwise have historically tended to be evaluated independently and without characterization of impact on NLP applications. More information is available at the STS 2013 Shared Task homepage.

The source data is Arabic and Chinese newswire and web data collected by LDC that was translated and used in the DARPA GALE (Global Autonomous Language Exploitation) program and in several NIST Open Machine Translation evaluations. Of the 750 sentence pairs, 150 pairs are from the GALE Phase 5 collection and 600 pairs are from NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07).

The data was built to identify semantic textual similarity between two short text passages. The corpus is comprised of two tab delimited sentences per line. The first sentence is a translation and the second sentence is a post-edited translation. Post-editing is a process to improve machine translation with a minimum of manual labor. The gold standard similarity values and other STS datasets can be obtained from the STS homepage, linked above.

Semantic Textual Similarity (STS) 2013 Machine Translation is distributed via web download.