Fall 2016 Data Scholarship Program
2015 User Survey Results
New Publications:
______________________________________________________________________
Fall 2016 Data Scholarship Program
Applications are now being accepted through Thursday, September 15, 2016 for the Fall 2016 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost.
This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.
The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a two-page proposal describing their intended use of the data. The proposal should state which data the student plans to use, how the data will benefit their research project, the proposed methodology or algorithm which will be used and how success will be measured.
Applicants should consult the Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.
(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must be signed and printed on letterhead, describe the student and the research, evaluate the probability of success and confirm that the department or university lacks the funding to pay the full non-member fee for the data.
For further information on application materials and program rules, please visit the LDC Data Scholarship page.
2015 User Survey Results
LDC conducted its fourth user survey in December 2015. This survey
built on the previous surveys conducted in 2006, 2007 and 2012 to assess user
sentiment and also asked for the evaluation of key LDC-related topics
including:
· Opinions on the new website
and usability of the Catalog
· Use and satisfaction with
the enhanced user services and e-commerce system
· LDC’s Data Management Plan
capabilities
· Suggestions for future
publications and preferred data delivery methods
· Use of web services for
data access and processing
Overall, survey respondents were satisfied with LDC’s data,
membership options, website, Catalog and enhanced user services. Participants
cited the top five most useful corpora received between 2012 and 2015 as OntoNotes
Release 5.0, TIMIT, TAC KBP Reference Knowledge Base, Penn
Discourse Treebank V 2.0, and Multi-Channel WSJ Audio. Three fourths
of respondents prefer digital delivery of data and the top three languages for
current research demands were identified as English, Chinese and Spanish.
We thank everyone who participated in this survey. Responses will
benefit the future of the Consortium and will help LDC to better meet the needs
of our members and data licensees.
New Corpora
(1) English Speed Networking Conversational Transcripts was
developed at the University
of the West of England and contains 388 transcripts of English
face-to-face and instant messaging conversations about business ideas
collected in 2014 and 2015 from participants (undergraduate students) playing
different power roles.
This corpus was created to examine communication accommodation,
specifically, the ways in which an individual's linguistic style is affected by
social power and personality. The data was collected in two studies. In the
first study, 40 participants had a series of paired five minute face-to-face
conversations playing either a high, low or neutral power role. The same procedure
was followed in the second study except that participants discussed business
ideas via instant messaging.
The face-to-face conversations were audio-recorded and transcribed
verbatim.
All transcripts are presented as UTF-8 plain text files.
English Speed Networking Conversational Transcripts is distributed
via web download.
2016 Subscription Members will automatically receive two copies of
this corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $400.00
*
(2) Digital Archive of Southern Speech - NLP Version (DASS-NLP) was
developed by LDC as an alternate version of Digital Archive of Southern Speech
(DASS) (LDC2012S03) suitable for natural language processing and human language
technology applications. Specifically, the original audio files have been
converted to 16kHz 16-bit flac compressed wav and file names have been
normalized to facilitate automatic processing.
DASS was developed by the University of Georgia. It is a subset of the Linguistic
Atlas of the Gulf States (LAGS), which is in turn part of the Linguist Atlas
Project (LAP). DASS-NLP contains approximately 366 hours of English speech data
from 30 female speakers and 34 male speakers, along with associated metadata
about the speakers, the recordings and maps in .jpeg format relating to the
recording locations.
LAP consists of a set of survey research projects about the words
and pronunciation of everyday American English, the largest project of its kind
in the United States. Interviews with thousands of native speakers across the
country have been carried out since 1929. LAGS surveyed the everyday speech of
Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and
Texas in a series of 914 audio-taped interviews conducted from 1968-1983.
The speakers' average age is 61 years; there are 30 women and 34
men from the Gulf States region represented in this release. The interviews
cover common topics such as family, the weather, household articles and
activities, agriculture and social conditions.
Digital Archive of Southern Speech - NLP Version is distributed
via web download.
2016 Not-for-Profit Subscription Members will automatically
receive two copies of this corpus. 2016 For-Profit Subscription
Members will receive two copies provided they have submitted a completed
copy of the For-Profit Member User
License Agreement for Digital Archive of Southern Speech – NLP Version
(LDC2016S05). 2016 Standard Members may request a copy as part of
their 16 free membership corpora. This data is being made available at no-cost
for non-member organizations under a research license.
*
(3) GALE Phase 3 and 4 Chinese Broadcast News Parallel Text was
developed by LDC. Along with other corpora, the parallel text in this release
comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program. This corpus contains Chinese source text and
corresponding English translations selected from broadcast news data collected
by LDC between 2006 and 2008 and transcribed and translated by LDC or under its
direction.
GALE Phase 3 and 4 Chinese Broadcast News Parallel Text includes
76 source-translation document pairs, comprising 614,608 tokens of Chinese
source text and its English translation. Data is drawn from 16 distinct Chinese
programs broadcast between 2006 and 2008 by China Central TV, a national and
international broadcaster in Mainland China and Phoenix TV, a Hong Kong-based
satellite television station. The programs in this release feature news
programs on current events topics.
The files in this release were transcribed by LDC staff and/or transcription
vendors under contract to LDC in accordance with the Quick Rich Transcription
guidelines developed by LDC.
Source data and translations are distributed in TDF format. All
data are encoded in UTF-8.
GALE Phase 3 and 4 Chinese Broadcast News Parallel is distributed
via web download
2016 Subscription Members will automatically receive two copies of
this corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1750.00
*
(4) IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c was
developed by Appen for the IARPA (Intelligence Advanced Research Projects
Activity) Babel program. It contains approximately 215 hours of Cantonese
conversational and scripted telephone speech collected in 2011 along with
corresponding transcripts.
The Babel program focuses on underserved languages and seeks to
develop speech recognition technology that can be rapidly applied to any human
language to support keyword search performance over large amounts of recorded
speech.
The Cantonese speech in this release represents that spoken in the
Chinese provinces of Guangdong and Guangxi, and within those provinces, among
five dialect groups. The gender distribution among speakers is approximately
even; speakers' ages range from 16 years to 67 years. Calls were made using
different telephones (e.g., mobile, landline) from a variety of environments
including the street, a home or office, a public place, and inside a vehicle.
All audio data is presented as 8kHz 8-bit a-law encoded audio in
sphere format. Transcripts are available in two versions: simplified Chinese
characters and a romanization scheme based on the Yale system, both encoded in
UTF-8.
IARPA Babel Cantonese Language Pack IARPA is distributed via web
download
2016 Subscription Members will receive two copies of this corpus
provided they have submitted a completed copy of the IARPA User Agreement
for Not-for-Profit Members or the IARPA User Agreement for For-Profit Members.
2016 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for US $25.00 under a research license.