2012 User Survey Results
LDC to Close for Thanksgiving Break
LDC to Close for Thanksgiving Break
New
publications:
Spring 2013 LDC Data Scholarship
Program
Applications
are now being accepted through January 15, 2013, 11:59PM EST for the Spring
2013 LDC Data Scholarship program! The LDC Data Scholarship program provides
university students with access to LDC data at no-cost. During previous program
cycles, LDC has awarded no-cost copies of LDC data to over 25 individual
students and student research groups.
This
program is open to students pursuing both undergraduate and graduate studies in
an accredited college or university. LDC Data Scholarships are not restricted
to any particular field of study; however, students must demonstrate a
well-developed research agenda and a bona fide inability to pay. The selection
process is highly competitive.
The
application consists of two parts:
(1) Data Use Proposal.
Applicants must submit a proposal describing their intended use of the data.
The proposal should state which data the student plans to use and how the data
will benefit their research project as well as information on the proposed
methodology or algorithm.
Applicants
should consult the LDC Corpus Catalog for a
complete list of data distributed by LDC. Due to certain restrictions, a
handful of LDC corpora are restricted to members of the Consortium. Applicants
are advised to select a maximum of one to two datasets; students may apply for
additional datasets during the following cycle once they have completed
processing of the initial datasets and publish or present work in some juried
venue.
(2) Letter of Support.
Applicants must submit one letter of support from their thesis adviser or
department chair. The letter must verify the student's need for data and
confirm that the department or university lacks the funding to pay the full
Non-member Fee for the data or to join the consortium.
For
further information on application materials and program rules, please visit
the LDC Data Scholarship page. Students can email
their applications to the LDC Data Scholarship program.
Decisions will be sent by email from the same address. The deadline for the
Spring 2013 program cycle is January 15, 2013, 11:59 PM EST.
Membership
Year (MY) 2013 is open for joining! We would like to invite all current
and previous members of LDC to renew their membership as well as welcome new
organizations to join the consortium. For MY2013, LDC is pleased to maintain
membership fees at last year’s rates – membership fees will not increase.
Additionally,
LDC will extend discounts on membership fees to members who keep their
membership current and who join early in the year.
The details of our early renewal discounts for MY2013 are as follows:
The details of our early renewal discounts for MY2013 are as follows:
· Organizations who joined for MY2012
will receive a 5% discount when renewing. This discount will apply throughout
2013, regardless of time of renewal. MY2012 members renewing before March 1,
2013 will receive an additional 5% discount, for a total 10% discount off the
membership fee.
· New members as well as
organizations who did not join for MY2012, but who held membership in any of
the previous MYs (1993-2011), will also be eligible for a 5% discount provided
that they join/renew before March 1, 2013.
The
following table provides exact pricing information.
MY2013
Fee
|
MY2013
Fee
with 5% Discount* |
MY2013
Fee
with 10% Discount** |
||
Not-for-Profit
/US Government
|
||||
Standard
|
US$2400
|
US$2280
|
US$2160
|
|
Subscription
|
US$3850
|
US$3658
|
US$3465
|
|
For-Profit
|
||||
Standard
|
US$24000
|
US$22800
|
US$21600
|
|
Subscription
|
US$27500
|
US$26125
|
US$24750
|
* For new members, MY2012 Members renewing for MY2013, and any previous year Member who renews before March 1, 2013
** For MY2012 Members renewing before March 1, 2013
Publications for MY2013 are still being planned; here are the working titles of data sets we intend to provide:
- Arabic Treenbank –
Weblog
- Chinese-English
Biomedical Parallel Text
- GALE data – all phases
and tasks
- Hispanic-English
Speech
- Maninkakan Lexicon
- OpenMT 2008-2012
Progress set
In
addition to receiving new publications, current year members of the LDC also
enjoy the benefit of licensing older data at reduced costs; current year
for-profit members may use most data for commercial applications.
This past year, LDC members who joined early or kept their membership current saved almost US$70,000 collectively on membership fees. Be sure to keep an eye on your mail - all previous and current LDC members will be sent an invitation to join letter and renewal invoice for MY2013. Renew early for MY2013 to save today!
This past year, LDC members who joined early or kept their membership current saved almost US$70,000 collectively on membership fees. Be sure to keep an eye on your mail - all previous and current LDC members will be sent an invitation to join letter and renewal invoice for MY2013. Renew early for MY2013 to save today!
LDC
is offering early renewal discounts on membership fees for Membership Year 2013
making now a good time to consider joining or renewing membership. LDC
membership has the following advantages:
- LDC membership provides cost-effective access to an extensive and growing catalog that spans 20 years and includes over 500 multilingual speech, text, and video resources. Even if your organization only needs a few datasets from a given membership year, membership is often the most economical way to obtain current corpora. Additionally, the generous discounts that member organizations receive on older corpora reduce the cost of acquiring such datasets.
- All members enjoy unlimited use of LDC data within their organizations. For universities, there is no difference in cost between a departmental membership and one that is university-wide. Departments can therefore combine resources and establish one LDC membership for use by the entire university community. Likewise, for-profit members with multiple branches can maintain one membership for use by their entire organizations.
For-profit
organizations are reminded that an LDC membership is a pre-requisite for
obtaining a commercial license to almost all LDC databases. Non-member
organizations, including non-member for-profit organizations, cannot use LDC
data to develop or test products for commercialization, nor can they use LDC
data in any commercial product or for any commercial purpose. LDC data users
should consult corpus-specific license agreements for limitations, including
commercial restrictions, on the use of certain corpora. In the case of a small
group of corpora, commercial licenses must be obtained separately from the
owners of the data.
Earlier
this year, LDC sent a survey to its user communities. Like previous iterations
in 2006 and 2007, the survey solicited community input and suggestions on key
LDC-related topics, including:
- Satisfaction levels with LDC’s
data, homepage and Catalog
- Reflections on LDC’s 20th
Anniversary year
- Suggestions for future
publications
- Speculations on the future of
HLT-related fields, specifically on mobile technologies, cloud computing,
social networking and open data
Survey
respondents were generally satisfied with LDC’s data, membership options,
homepage and Catalog, though there were requests for additional data options
and data acquisition methods. Some of the data respondents requested are
already in our pipeline for the end of 2012 or for Membership Year (MY) 2013,
so please be on the lookout for Publications updates.
Respondents
were also very supportive of LDC’s 20th Anniversary, posting
testimonials and well-wishes in the 20th Anniversary section.
LDC
would like to thank all survey participants. Survey participants will receive
access to full survey results shortly.
LDC will be closed on Thursday, November 22, 2012 and Friday, November 23, 2012 in observance of the US Thanksgiving Holiday. Our offices will reopen on Monday, November 26, 2012.
New publications
(1) Annotated English Gigaword
was developed by Johns Hopkins University's Human Language Technology Center
of Excellence. It adds automatically-generated syntactic and
discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also
contains an API and tools for reading the dataset's XML files. The goal of the
annotation is to provide a standardized corpus for knowledge extraction and
distributional semantics which enables broader involvement in large-scale
knowledge-acquisition efforts by researchers.
Annotated
English Gigaword contains the nearly ten million documents (over four billion
words) of the original English Gigaword Fifth Edition from seven news sources:
- Agence France-Presse, English Service (afp_eng)
- Associated Press Worldstream, English Service (apw_eng)
- Central News Agency of Taiwan, English Service (cna_eng)
- Los Angeles Times/Washington Post Newswire Service (ltw_eng)
- Washington Post/Bloomberg Newswire Service (wpb_eng)
- New York Times Newswire Service (nyt_eng)
- Xinhua News Agency, English Service (xin_eng)
The
following layers of annotation were added:
- Tokenized and segmented sentences
- Treebank-style constituent parse trees
- Syntactic dependency trees
- Named entities
- In-document coreference chains
The
annotation was performed in a three-step process: (1) the data was preprocessed
and sentences selected for annotation (sentences with more than 100 tokens were
excluded); (2) syntactic parses were derived; and (3) the parsed output was
post-processed to derive syntactic dependencies, named entities and coreference
chains. Over 183 million sentences were parsed.
Annotated
English Gigaword is distributed on one hard drive. 2012 Subscription Members
will automatically receive one copy of this data on hard drive. 2012
Standard Members may request a copy as part of their 16 free membership
corpora. 2011 Members who licensed English Gigaword Fifth Edition (LDC2011T07) may request a
no-cost copy of Annotated English Gigaword. Non-member organizations who
licensed English Gigaword Fifth Edition may request a copy of Annotated English
Gigaword for the US$200 media fee.
*
(2) Chinese-English Semiconductor Parallel
Text was developed by The MITRE Corporation. It consists of parallel
sentences from a collection of abstracts from scientific articles on
semiconductors published in Mandarin and translated into English by translators
with particular expertise in the technical area. Translators were instructed to
err on the side of literal translation if required, but to maintain the
technical writing style of the source and to make the resulting English as
natural as possible. The translators followed specific guidelines for
translation, and those are included in this distribution.
There
are 2,169 lines of parallel Mandarin and English, with a total of 125,302
characters of Mandarin and 64,851 words of English, presented in a separate
UTF-8 plain text file for each language. The sentences were translated in
sequential order and presented in a scrambled order, such that parallel
sentences at identical line numbers are translations. For example, the 31st
line of the English file is a translation of the 31st line of the Mandarin
file. The original line sequence is not provided.
Chinese-English
Semiconductor Parallel Text is distributed via web download.
2012
Subscription Members will automatically receive two copies of this data on
disc. 2012 Standard Members may request a copy as part of their 16 free
membership corpora.
*
(3) GALE Phase 2 Arabic Newswire Parallel
Text was developed by LDC. Along with other corpora, the
parallel text in this release comprised training data for Phase 2 of the DARPA
GALE (Global Autonomous Language Exploitation) Program. This corpus contains
Modern Standard Arabic source text and corresponding English translations
selected from newswire data collected in 2007 by LDC and transcribed by LDC or
under its direction.
GALE
Phase 2 Arabic Newswire Parallel Text includes 400 source-translation pairs,
comprising 181,704 tokens of Arabic source text and its English translation.
Data is drawn from six distinct Arabic newswire sources: Al Ahram, Al Hayat,
Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.
The
files in this release were transcribed by LDC staff and/or transcription
vendors under contract to LDC in accordance with the Quick Rich Transcription
guidelines developed by LDC. Transcribers indicated sentence boundaries in
addition to transcribing the text. Data was manually selected for translation
according to several criteria, including linguistic features, transcription
features and topic features. The transcribed and segmented files were then
reformatted into a human-readable translation format and assigned to
translation vendors. Translators followed LDC's Arabic to English translation
guidelines. Bilingual LDC staff performed quality control procedures on the
completed translations.
GALE
Phase 2 Arabic Newswire Parallel Text is distributed via web download.
2012
Subscription Members will automatically receive two copies of this data on
disc. 2012 Standard Members may request a copy as part of their 16 free
membership corpora.