Invitation to Join for
Membership Year (MY) 2016
Commercial use and LDC data
Spring 2016 Data Scholarship Program
LDC closed for
Thanksgiving Break
New publications:
New publications:
Invitation to Join for
Membership Year (MY) 2016
Membership Year (MY) 2016 is open for joining. We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium. For MY2016, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase. Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.
The details of our early renewal discounts for MY2015 are as follows:
Membership Year (MY) 2016 is open for joining. We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium. For MY2016, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase. Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.
The details of our early renewal discounts for MY2015 are as follows:
- Organizations who joined for MY2015 will receive a 10% discount when renewing before March 1, 2016. After March 1, 2016, MY2015 members are eligible for a 5% discount when renewing through the end of the year.
- New members as well as organizations who did not join for MY2015, but who held membership in any of the previous MYs (1993-2014), will also be eligible for a 5% discount provided that they join/renew before March 1, 2016.
Publications for MY2016 are still being planned but we plan to
release the following:
- Arabic Treebank - Weblog ~ part-of-speech/morphological annotation and syntactic tree annotation of web text from various sources
- BOLT ~ all phases, languages, genres, tasks
- DEFT ~ Spanish and Chinese resources
- Digital Archive of Southern Speech - NLP Version ~ colloquial speech in the Southern United States; NLP version normalizes filenames and formats
- GALE Phase 3 and 4 data ~ all tasks and languages
- HAVIC ~ amateur video and transcripts
- NewSoMe Corpus of Opinion in Blogs ~ opinion annotated English and Spanish blogs
Commercial use and LDC data
For-profit organizations are
reminded that an LDC membership is a pre-requisite for obtaining a commercial
license to almost all LDC databases. Non-member organizations, including
non-member for-profit organizations, cannot use LDC data to develop or test
products for commercialization, nor can they use LDC data in any commercial
product or for any commercial purpose. LDC data users should consult
corpus-specific license agreements for limitations on the use of certain
corpora. Visit our Licensing
page for further information.
Spring 2016 Data Scholarship Program
Applications are now being accepted
through Friday, January 15, 2016 for the Spring 2016 LDC Data Scholarship
program. The LDC Data Scholarship program provides university students with access
to LDC data at no-cost. This program is open to students pursuing both
undergraduate and graduate studies in an accredited college or university. LDC
Data Scholarships are not restricted to any particular field of study; however,
students must demonstrate a well-developed research agenda and a bona fide
inability to pay.
(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.
(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full nonmember fee for the data or to join the Consortium.
For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.
The deadline for the Spring 2016 program cycle is January 15, 2016.
LDC closed for Thanksgiving Break
LDC will be closed on Thursday, November 26, 2015 and Friday, November 27, 2015 in observance of the US Thanksgiving Holiday. Our offices will reopen on Monday, November 30, 2015.
New publications
(1) Articulation Index LSCP was developed by researchers at Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises and enhances a subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons speaking English syllables. Changes include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions. AIC consists of 20 American English speakers (12 males, 8 females) pronouncing syllables, some of which form actual words, but most of which are nonsense syllables. All possible Consonant-Vowel (CV) and Vowel-Consonant (VC) combinations were recorded for each speaker twice, once in isolation and once within a carrier-sentence, for a total of 25768 recorded syllables.
Articulation Index LSCP alters AIC
in the following ways.
- Time-alignments for the onset and offset of each word and syllable were generated through forced-alignment with a standard HMM-GMM (Hidden Markov Model-Gaussian Mixture Model) ASR system.
- The time-alignments for the beginning and end of the syllables (whether in isolation or within a carrier sentence) were manually adjusted. The time-alignments for the other words in carrier sentences were not manually adjusted
- The recordings of isolated syllables were cut according to the manual time-alignments to remove the silent portions at the beginning and end, and the time-alignments were altered to correspond to the cut recordings
- The file naming scheme was slightly altered for compatibility with the Kaldi speech recognition toolkit.de
- AIC contains a wide-band (16 KHz, 16-bit PCM) and a narrow-band (8 KHz, 8 bit u-law) version of the recordings distributed in sphere format. The LSCP version contains the wide-band version only distributed as wave files.
Articulation Index LSCP is
distributed via web download.
2015 Subscription Members will
automatically receive two copies of this corpus. 2015 Standard Members
may request a copy as part of their 16 free membership corpora. This data
is being made available at no-cost for non-member organizations under a research license.
*
(2) GALE Phase 4 Chinese Newswire Parallel Sentences was developed by LDC. Along with other corpora, the
parallel text in this release comprised training data for Phase 4 of the DARPA
GALE (Global Autonomous Language Exploitation) Program. This corpus contains
Chinese source sentences and corresponding English translations selected from
newswire data collected by LDC in 2008 and translated by LDC or under its
direction.
GALE Phase 4 Chinese Newswire
Parallel Sentences includes 627 source-translation document pairs, comprising
90,434 tokens of Chinese source text and its English translation. Data is drawn
from six distinct Chinese newswire sources.
Sentences were selected for
translation in two steps. First, files were chosen using sentence selection
scripts. Selected files were reformatted into a human-readable
translation format and assigned to translation vendors. Translators followed
LDC's Chinese to English translation guidelines and were provided with the full
source documents containing the target sentences for their reference. Bilingual
LDC staff performed quality control procedures on the completed translations.
GALE Phase 4 Chinese Newswire
Parallel Sentences is distributed via web download.
2015 Subscription Members will
automatically receive two copies of this corpus. 2015 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(3) KHATT: Handwritten Arabic Text
was developed by King Fahd University of Petroleum & Minerals, Technical University of Dortmund and Braunschweig University of Technology. It is comprised of scanned Arabic handwriting from 1,000
distinct male and female writers representing diverse countries, age groups,
handedness and education levels. Participants produced text on a topic of their
choice in an unrestricted style. KHATT was designed to promote research in
areas such as text recognition and writer identification.
The majority of participants were
natives of Saudi Arabia; the next largest group was from a collection of
regional countries (Egypt, Jordan, Kuwait, Morocco, Palestine, Tunisia and
Yemen). Most writers were between 16-25 years of age with high school or
university qualifications.
KHATT: Handwritten Arabic Text is
distributed on one USB drive.
2015 Subscription Members will
automatically receive a copy of this corpus. 2015 Standard Members may
request a copy as part of their 16 free membership corpora. Non-members
may license this data for a fee.