New publications
English News Text Treebank: Penn Treebank Revised
TS Wikipedia
The Walking Around Corpus
_________________________________________________________________________
Fall 2015 Data
Scholarship Program
Applications are now being accepted through Tuesday,
September 15, 2015 for the Fall 2015 LDC Data Scholarship program. The LDC Data
Scholarship program provides university students with access to LDC data at
no-cost.
This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.
The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.
Applicants should consult the LDC Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.
(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must confirm that the department or university lacks the funding to pay the full non-member fee for the data and verify the student's need for data.
For further information on application materials and program rules, please visit the LDC Data Scholarship page.
New publications
(1) English
News Text Treebank: Penn Treebank Revised was developed by LDC with funding
through a gift from Google Inc. It consists of a combination of automated and
manual revisions of the Penn
Treebank annotation of Wall Street Journal (WSJ) stories. The data is
comprised of 1,203,648 word-level tokens in 49,191 sentence-level tokens -- in
all 2,312 of the original Penn Treebank WSJ files.
This release includes revised tokenization, part-of-speech,
and syntactic treebank annotation intended to bring the full WSJ treebank
section into compliance with the agreed-upon policies and updates implemented
for current English treebank annotation specifications at LDC. Examples include
English Web Treebank (LDC2012T13),
OntoNotes (LDC2013T19),
and English translation treebanks such as English Translation Treebank:
An-Nahar Newswire (LDC2012T02).
English Treebank Supplemental Guidelines are included in this release.
2015 Subscription Members will automatically receive two
copies of this corpus on disc. 2015 Standard Members may request a copy
as part of their 16 free membership corpora. Non-members may license this
data for a fee.
(2) TS
Wikipedia is a collection of approximately 1.6 million processed Turkish
Wikipedia pages. The data is tokenized and includes part-of-speech tags,
morphological analysis, lemmas, bi-grams and tri-grams.
The data is in a word-per-line format with five
tab-separated columns: token, part-of-speech tag, morphological analysis, lemma
and corrected token spelling if needed. All data is presented in UTF-8 XML
files and was selected and filtered to reduce non-Turkish characters,
mathematical formulas and non-Turkish entries.
TS Wikipedia is distributed via web download.
2015 Subscription Members will automatically receive two
copies of this corpus on disc. 2015 Standard Members may request a copy
as part of their 16 free membership corpora. Non-members may license this
data for a fee. TS Wikipedia is made available to for-profit members
under the LDC For-Profit Membership Agreement and to not-for-profit members and
non-members under the Creative Commons
Attribution-Noncommercial Share Alike 3.0 license.
(3) The
Walking Around Corpus was developed by Stony
Brook University and is comprised of approximately 33 hours of navigational
telephone dialogues from 72 speakers (36 speaker pairs). Participants were
Stony Brook University students who identified themselves as native English
speakers.
This corpus was elicited using a navigation task in which
one person directed another to walk to 18 unique destinations on Stony Brook
University’s West campus. The direction-giver remained inside the lab and gave
directions on a landline telephone to the pedestrian who used a mobile phone.
As they visited each location, the pedestrians took a picture of each of the 18
destinations using the mobile phone. Pairs conversed spontaneously as they
completed the task. The pedestrians' locations were tracked using their cell
phones' GPS systems. The pedestrians did not have any maps or pictures of the
target destinations and therefore relied on the direction-giver's verbal
directions and descriptions to locate and photograph the target destinations.
Each digital audio file was transcribed with time stamps.
The corpus material also includes the visual materials (pictures and maps) used
to elicit the dialogues, data about the speakers' relationship, spatial
abilities and memory performance, and other information.
All audio is presented as 8000Hz, 16-bit flac compressed
wav. Transcripts are presented as xls spreadsheets.
The Walking Around Corpus is distributed via web download.
2015 Subscription Members will automatically receive two
copies of this corpus on disc. 2015 Standard Members may request a copy
as part of their 16 free membership corpora. Non-members may license this
data for a fee.