Monday, November 17, 2014

LDC 2014 November Newsletter

Fall 2014 Data Scholarship Recipients

Invitation to Join for Membership Year (MY) 2015

Spring 2015 Data Scholarship Program

LDC is now on Twitter

LDC closed for Thanksgiving Break

New publications:

Fall 2014 Data Scholarship Recipients
LDC is pleased to announce the student recipients of the Fall 2014 LDC Data Scholarship program.  The following students will receive no-cost copies of LDC data:
Mohammed Abumatar ~ University of Jordan (Jordan), Bsc Candidate, Computer Engineering.  Mohammed has been awarded a copies of MADCAT Phase 1-3 Training Data for his work in handwriting recognition.

Ramy Baly ~ American University of Beirut (Lebanon), PhD candidate, Electrical and Computer Engineering.  Ramy has been awarded a copies of Arabic Treebank Parts 1-3 for his work in opinion mining.

Abbas Khosravanai ~ Amirkabir University of Technology (Iran), PhD candidate, Computer Engineering.  Abbas has been awarded a copy of 2008 NIST Speaker Recognition for his work in robust speaker recognition.

Phuc Nguyen ~ University of North Texas (USA), PhD candidate, Computer Science and Engineering.  Phuc has been awarded a copy of Message Understanding Conference (MUC) 7 for his work in named entity recognition.

Invitation to Join for Membership Year (MY) 2015
Membership Year (MY) 2015 is open for joining.  We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium.  For MY2015, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase.  Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.

The details of our early renewal discounts for MY2015 are as follows:

Organizations who joined for MY2014 will receive a 10% discount when renewing before March 2, 2015. After March 2, 2015, MY2014 members are eligible for a 5% discount when renewing through the end of the year.

New members as well as organizations who did not join for MY2014, but who held membership in any of the previous MYs (1993-2013), will also be eligible for a 5% discount provided that they join/renew before March 2, 2015.

Publications for MY2015 are still being planned but we plan to release the following:

  • CIEMPIESS - Mexican Spanish radio broadcast audio and transcripts   
  • GALE Phase 3 and 4 data – all tasks and languages   
  • Mandarin Chinese Phonetic Segmentation and Tone Corpus - phonetic segmentation and tone labels   
  • RATS Speech Activity Detection  – multilanguage audio for robust speech detection and language identification
  • SEAME - Mandarin-English code-switching speech
  • SenSem Spanish and Catalan Lexicon and Databank - sentence semantics and verbal lexicons

Spring 2015 Data Scholarship Program
Applications are now being accepted through Thursday, January 15, 2015, 11:59PM EST for the Spring 2015 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost. During previous program cycles, LDC has awarded no-cost copies of LDC data to over 40 individual students and student research groups. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full non-member fee for the data or to join the Consortium.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Spring 2015 program cycle is January 15, 2015, 11:59PM EST.

LDC is now on Twitter
LDC now has a Twitter feed. Start following us today for updates on new corpora releases and the latest LDC news.

LDC closed for Thanksgiving Break
LDC will be closed on Thursday, November 27, 2014 and Friday, November 28, 2013 in observance of the US Thanksgiving Holiday.  Our offices will reopen on Monday, December 1, 2014.


New publications

(1) Boulder Lies and Truth was developed at the University of Colorado Boulder and contains approximately 1,500 elicited English reviews of hotels and electronics for the purpose of studying deception in written language. Reviews were collected by crowd-sourcing with Amazon Medical Turk.

Each review was required to be original and was checked for plagiarism against the web. Reviews were annotated with respect to the following three dimensions:
Domain: Electronics (e.g., iPhone) or Hotels
Sentiment: Positive or Negative
Truth Value:

a) Truthful: a review about an object known by the writer reflecting the real sentiment of the writer toward the object of the review

b) Opposition: A review about an object known by the writer reflecting the opposite sentiment of the writer toward the object of the review (i.e., if the writer liked the object they were asked to write a negative review; if the writer did not like the object, they were asked to write a positive review)

c) Deceptive (i.e., fabricated): a review written about an object not known by the writer either positive or negative in sentiment; the objects reviewed were provided via a URL from the tasks in (a) and (b)

Each review was judged a total of 30 times: (1) 10 times to evaluate its perceived quality (on a range from 1-5); (2) 10 times with judgments about its perceived truthfulness (e.g., truthful or somehow deceptive, a lie or a fabrication); and (3) 10 times for its perceived sentiment (i.e., star rating).

Boulder Lies and Truth is distributed via web download.

2014 Subscription Members will receive two copies of this data on disc, provided they have completed the user license agreement.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  This data is available at no-cost for non-members under the same user license agreement.

*

(2) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 was developed by LDC and contains 65,069 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) programming collected by LDC in 2008.

The Chinese word alignment tasks consisted of the following components:
Identifying, aligning, and tagging eight different types of links
Identifying, attaching, and tagging local-level unmatched words
Identifying and tagging sentence/discourse-level unmatched words
Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALE Phase 2 Chinese Web Parallel Text was developed by LDC and along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.

This release includes 46 source-translation document pairs, comprising 66,779 tokens of translated data. Data is drawn from four Chinese weblog and newsgroup sources.
Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were formatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 2 Chinese Web Parallel Text is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Thursday, October 16, 2014

LDC 2014 October Newsletter

LDC at NWAV 43 

LDC Data Scholarship Update 

New publications:
Chinese Discourse Treebank 0.5 
GALE Arabic-English Word Alignment -- Broadcast Training Part 2 
United Nations Proceedings Speech ________________________________________________________________

LDC at NWAV 43 

LDC will be exhibiting at the 43rd New Ways of Analyzing Variation Conference (NWAV 43)  held this year October 23-26 in Chicago, Illinois. Please stop by our table in the Old Town Room on the third floor of the Hilton to learn more about the most recent developments at the Consortium and to check out our latest giveaways. As always, LDC will post conference updates via our Facebook page. We hope to see you in Chicago!

LDC Data Scholarship Update

LDC received many solid applications for the Fall 2014 LDC Data Scholarship Program.  We are in the process of reviewing submissions and will announce recipients soon. The LDC Data Scholarship program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser.

Data use proposals in this cycle included a range of research interests from opinion mining tagging to deceptive speech classification.

New publications

(1) Chinese Discourse Treebank 0.5 was developed at Brandeis University as part of the Chinese Treebank Project and consists of approximately 73,000 words of Chinese newswire text annotated for discourse relations. It follows the lexically grounded approach of the Penn Discourse Treebank (PDTB) (LDC2008T05) with adaptations based on the linguistic and statistical characteristics of Chinese text. Discourse relations are lexically anchored by discourse connectives (e.g., because, but, therefore), which are viewed as predicates that take abstract objects such as propositions, events and states as their arguments. Along with PDTB-style schemes for English, Turkish, Hindi and Czech, Chinese Discourse Treebank provides an additional perspective on how the PDTB approach can be extended for cross-lingual annotation of discourse relations.

Data was selected from the newswire material in Chinese Treebank 8.0 (LDC2013T21), specifically, from Xinhua News Agency stories. There are approximately 5,500 annotation instances. Following the PDTB format, each annotation instance consists of 27 vertical bar delimited fields. The fields specify the attributes of the discourse relation as a whole, as well as the attributes of its two arguments. Not all fields are filled in this release. Filled fields are indicated by a pair of angle brackets; the remaining fields are place holders for future releases.

Chinese Discourse Treebank 0.5 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 2 was developed by LDC and contains 215,923 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast conversation data collected by LDC from 2007-2009.The Arabic word alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed

Identifying different types of links

Identifying sentence segments not suitable for annotation

Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment – Broadcast Training Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) United Nations Proceedings Speech was developed by the United Nations (UN) and contains approximately 8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly (GA) and First Committee (FC) (Disarmament and International Security), and meetings 6434-6763 of the Security Council.

Recordings were made using a customized system following a daily internal circulated instruction from the Meetings Management Section. Most of the subjects and information related to a particular meeting or session are published in a UN Journal which can be found in the following here.

Data is presented either as mp3 or flac compressed wav and are 16-bit single channel files in either 22,050 or 8,000 Hz organized by committee and session number, then language. The folder labeled "Floor" indicates the microphone used by the particular speaker. Those files may include other languages, for instance, if the speaker's language was not among the six official UN languages.

United Nations Proceedings Speech is distributed on one hard drive.

2014 Subscription Members will receive one copy of this data, provided they have completed the user license agreement.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Monday, September 22, 2014

LDC 2014 September Newsletter


LDC at Interspeech 2014, Singapore

New publications:


LDC at Interspeech 2014, Singapore

LDC is off to Singapore to participate in Interspeech 2014. This year’s conference will be held from September 14-18 at Singapore’s Max Atria at the Expo Center. Please stop by LDC’s exhibition booth to learn more about recent developments at the Consortium and new publications. LDC will continue to post conference updates via our Facebook page. We hope to see you there!   
 
New publications

(1) ACE 2007 Multilingual Training Corpus was developed by LDC and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.

The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task.

The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM).

Arabic
Words


Files




1P
2P
NORM
1P
2P
NORM
NW
58,015
58,015
58,015
257
257
257
WL
40,338
40,338
40,338
121
121
121
Total
98,353
98,353
98,353
378
378
378
Spanish






Words


Files




1P
2P
NORM
1P
2P
NORM
NW
100,401
100,401
100,401
352
352
352
Total
100,401
100,401
100,401
352
352
352

For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p" and "timex2norm". In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). All files are presented in UTF-8.

ACE 2007 Multilingual Training Corpus is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 1 was developed by LDC and contains 267,257 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast conversation data collected by LDC from 2007-2009. The distribution by genre, words, tokens and segments appears below:

Language
Genre
Files
Words
Tokens
Segments
Arabic
BC
231
79,485
103,816
4,114
Arabic
BN
92
131,789
163,441
7,227
Totals

323
211,274
267,257
11,341

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:
Normalizing tokenized tokens as needed
Identifying different types of links
Identifying sentence segments not suitable for annotation
Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment -- Broadcast Training Part 1 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) GALE Phase 2 Chinese Newswire Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains 117,895 tokens of Chinese source text and corresponding English translations selected from newswire data collected by LDC in 2007 and translated by LDC or under its direction.

This release includes 177 source-translation document pairs, comprising 117,895 tokens of translated data. Data is drawn from four distinct Chinese newswire sources: China News Service, Guangming Daily, People's Daily and People's Liberation Army Daily.

Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were formatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.

GALE Phase 2 Chinese Newswire Parallel Text Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Friday, August 15, 2014

LDC 2014 August Newsletter


Fall 2014 LDC Data Scholarship program- September 15 deadline approaching
Neural Engineering Data Consortium publishes first release

New publications:



Fall 2014 LDC Data Scholarship program- September 15 deadline approaching!
Student applications for the Fall 2014 LDC Data Scholarship program are being accepted now through Monday, September 15, 2014, 11:59PM EST.  The LDC Data Scholarship program provides university students with access to LDC data at no cost.  This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.  

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser.  For further information on application materials and program rules, please visit the LDCData Scholarship page.  

Applicants can email their materials to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

Neural Engineering Data Consortium publishes first release
The Neural Engineering Data Consortium (NEDC) has announced its first release, the Temple University Hospital Electroencephalogram (TUH EEG) corpus. TUH EEG corpus is a database of over 20,000 EEG recordings which will aid the development of technology to automatically interpret EEG scans. NEDC, directed by Professors Iyad Obeid and Joe Picone of Temple University, Philadelphia, PA USA , designs, collects and distributes data and resources in support of neural engineering research

NEDC is surveying community needs to help set priorities for future effort. You can complete the survey here.

New publications
(1) GALE Phase 2 Arabic Broadcast News Speech Part 1 was developed by LDC and is comprised of approximately 165 hours of Arabic broadcast news speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast News Transcripts Part 1 (LDC2014T17).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast recordings in this release feature news programs focusing principally on current events from the following sources: Abu Dhabi TV, a televisions station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Alhurra, a U.S. government-funded regional broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Dubai TV, a broadcast station in the United Arab Emirates; Al Iraqiyah, an Iraqi television station; Kuwait TV, a national broadcast station in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

This release contains 200 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 2 Arabic Broadcast News Speech Part 1 is distributed on three DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALE Phase 2 Arabic Broadcast News Transcripts Part 1 was developed by LDC and contains transcriptions of approximately 165 hours of Arabic broadcast news speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. Corresponding audio data is released as GALE Phase 2 Arabic Broadcast News Speech Part 1 (LDC2014S07).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 897,868 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. 

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Arabic Broadcast News Transcripts Part 1 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) TAC KBP Reference Knowledge Base was developed by LDC in support of the NIST-sponsored TAC-KBP evaluation series. It is a knowledge base built from English Wikipedia articles and their associated infoboxes and covers over 800,000 entities.

TAC (Text Analysis Conference) is a series of workshops organized by NIST (the National Institute of Standards and Technology) to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. TAC's KBP track (Knowledge Base Population) encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

Consult the LDC TAC-KBP project page for further information about LDC's resource development for the TAC-KBP program.

The source data, Wikipedia infoboxes and articles, was taken from an October 2008 snapshot of Wikipedia.

TAC KBP Reference Knowledge Base contains a set of entities, each with a canonical name and title for the Wikipedia page, an entity type, an automatically parsed version of the data from the infobox in the entity's Wikipedia article, and a stripped version of the text of the Wiki article. Each entity is assigned one of four types: PER (person), ORG (organization), GPE (geo-political entity) and UKN (unknown). All data files are presented as UTF-8 encoded XML.

TAC KBP Reference Knowledge Base is distributed on one DVD-ROM.


2014 Subscription Members will automatically receive two copies of this data.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Friday, July 18, 2014

LDC July 2014 Newsletter


New publications:








Fall 2014 Data Scholarship Program


Applications are now being accepted through Monday, September 15, 2014, 11:59PM EST for the Fall 2014 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost.
 

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.
 

The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

 

 Applicants should consult the LDC Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.
 

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full non-member fee for the data and verify the student's need for data.
 

 For further information on application materials and program rules, please visit the LDC Data Scholarship page.



New publications

(1) 2009 NIST Language Recognition Evaluation Test Set contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese.


The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005 and 2007. The 2009 evaluation increased the number of target languages. Most of the test data originated from multilingual Voice of America (VOA) radio broadcasts assessed as being of telephone bandwidth in addition to conversational telephone speech. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release.


LDC released the prior LREs as:

2003 NIST Language Recognition Evaluation (LDC2006S31)
2005 NIST Language Recognition Evaluation (LDC2008S05)
2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)

The VOA speech data was collected by LDC in 2000 and 2001 and constitutes approximately 75% of the test set. The telephone speech was taken from LDC's Mixer 3 collection recorded between 2005 and 2007.


All test speech segments are presented as a sampled data stream in standard 8-bit 8-kHz μ-law format. Each segment is stored separately in a single channel SPHERE format file. The test segments contain three nominal durations of speech: 3 seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively.


2009 NIST Language Recognition Evaluation Test Set is distributed on 2 DVD-ROM. 2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(2) GALE Arabic-English Word Alignment Training Part 3 -- Web was developed by LDC and contains 217,158 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.


Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.


Other releases available in this series are:

GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web (LDC2012T16)
GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire (LDC2012T20)
GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web (LDC2012T24)
GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web (LDC2013T05)
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 (LDC2013T23)
GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web (LDC2014T05)
GALE Arabic-English Word Alignment Training Part 2 -- Newswire (LDC2014T10)

This release consists of Arabic source web data collected by LDC. The distribution by genre, words, character tokens and segments appears below:



Language
Genre
Files
Words
CharTokens
Segments
Arabic
WB
2,449
154,144
217,158
7,332


Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.


The Arabic word alignment tasks consisted of the following components:

Normalizing tokenized tokens as needed
Identifying different types of links
Identifying sentence segments not suitable for annotation
Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment Training Part 3 -- Web is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) GALE Phase 2 Chinese Newswire Parallel Text Part 1 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains 117,173 tokens of Chinese source text and corresponding English translations selected from newswire data collected by LDC in 2007 and transcribed by LDC or under its direction.


This release includes 167 source-translation document pairs, comprising 117,173 tokens of translated data. Data is drawn from four distinct Chinese newswire sources: China News Service, Guangming Daily, People's Daily and People's Liberation Army Daily.


The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.


Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.


GALE Phase 2 Chinese Newswire Parallel Text Part 1 is distributed via web download. 2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.