Renew your LDC membership today
Spring 2016 LDC Data Scholarship Program - deadline approaching
LDC at LSA 2016
LDC to close for Winter Break
New publications
________________________________________________________________________
Renew your LDC membership today
Membership Year 2016 (MY2016) discounts are available for those who keep their
membership current and join early in the year. Check here for
further information including our planned publications for MY2016.
Now is also a good time to consider joining LDC for the current and open membership years, MY2015 and MY2014. MY2015 includes data such as RATS Speech Activity Detection and updates to Penn Treebank. MY2014 remains open through the end of the 2015 calendar year and its publications include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. For full descriptions of these data sets, visit our Catalog.
Now is also a good time to consider joining LDC for the current and open membership years, MY2015 and MY2014. MY2015 includes data such as RATS Speech Activity Detection and updates to Penn Treebank. MY2014 remains open through the end of the 2015 calendar year and its publications include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. For full descriptions of these data sets, visit our Catalog.
Spring 2016 LDC Data
Scholarship Program - deadline approaching
The deadline for the Spring 2016 LDC
Data Scholarship Program is right around the corner! Student applications are
being accepted now through January 15, 2016, 11:59PM EST. The LDC Data
Scholarship program provides university students with access to LDC data at no
cost. This program is open to students pursuing both undergraduate and graduate
studies in an accredited college or university. LDC Data Scholarships are not
restricted to any particular field of study; however, students must demonstrate
a well-developed research agenda and a bona fide inability to pay.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.
Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.
Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.
LDC at LSA 2016
LDC will be exhibiting at the Annual Meeting of
the Linguistic Society of America, held January 7-10, 2016 in
Washington, DC. Stop by booth 110 to learn more about recent
developments at the Consortium and new publications. Also, be on
the lookout for the following presentations:
Satellite
Workshop:
Preparing Your Corpus for Archival Storage
Malcah Yaeger-Dror (University of Arizona) and Christopher Cieri (LDC)
Thursday, January 7, 2016 - 8:00am to 3:00pm, Salon 4
Malcah Yaeger-Dror (University of Arizona) and Christopher Cieri (LDC)
Thursday, January 7, 2016 - 8:00am to 3:00pm, Salon 4
Broadening connections among researchers in
linguistics and human language technologies
Jeff Good (University at Buffalo) and Christopher Cieri (LDC)
Friday, January 8, 2016 - 7:30am to 9:00am, Salon 1
Jeff Good (University at Buffalo) and Christopher Cieri (LDC)
Friday, January 8, 2016 - 7:30am to 9:00am, Salon 1
Diachronic
development
of pitch contrast in Seoul Korean
Sunghye Cho (UPenn), Yong-cheol Lee (Cheongju University) and Mark Liberman (LDC)
Friday, January 8, 2016 - 2:00pm to 5:00pm, Salon 1
Sunghye Cho (UPenn), Yong-cheol Lee (Cheongju University) and Mark Liberman (LDC)
Friday, January 8, 2016 - 2:00pm to 5:00pm, Salon 1
LDC to close for Winter Break
LDC will be closed from Friday, December
25, 2015 through Friday, January 1, 2016 in accordance with the University of
Pennsylvania Winter Break Policy. Our offices will reopen on Monday, January
4, 2016. Requests received for membership renewals and corpora during the
Winter Break will be processed at that time.
New publications
New publications
(1) 2006
CoNLL Shared Task - Arabic & Czech
consists of Arabic and Czech dependency treebanks used as part of the CoNLL
2006 shared task on multi-lingual dependency parsing.
This corpus is cross listed with
ELRA as ELRA-W0087.
The Conference
on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to
promote natural language processing applications and evaluate them in a
standard setting. In 2006, the shared task was devoted to the parsing of
syntactic dependencies using corpora from up to thirteen languages. The task
aimed to define and extend the then-current state of the art in dependency
parsing, a technology that complemented previous tasks by producing a different
kind of syntactic description of input text. More information about the 2006
shared task is available on the CoNLL-X web page.
This source data in this release
consists principally of news and journal texts. The individual data sets are
subsets of the following:
2006 CoNLL Shared Task - Arabic
& Czech is distributed via web download.
2015 Subscription Members will
automatically receive two copies of this corpus. 2015 Standard Members
may request a copy as part of their 16 free membership corpora. This data
is being made available at no-cost for non-member organizations under a research license.
*
(2) 2006
CoNLL Shared Task - Ten Languages
consists of dependency treebanks in ten languages used as part of the CoNLL
2006 shared task on multi-lingual dependency parsing. The languages covered in
this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene,
Spanish, Swedish and Turkish.
This corpus is cross listed and
jointly released with ELRA as ELRA-W0086.
The Conference
on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to
promote natural language processing applications and evaluate them in a
standard setting. In 2006 , the shared task was devoted to the parsing of
syntactic dependencies using corpora from up to thirteen languages. The task
aimed to define and extend the then-current state of the art in dependency
parsing, a technology that complemented previous tasks by producing a different
kind of syntactic description of input text. More information about the 2006
shared task is available on the CoNLL-X web page.
The source data in the treebanks in
this release consists principally of various texts (e.g., textbooks, news,
literature) annotated in dependency format. In general, dependency grammar is
based on the idea that the verb is the center of the clause structure and that
other units in the sentence are connected to the verb as directed links or
dependencies.
The individual data sets are:
- BulTreeBank (Bulgarian)
- The Danish Dependency Treebank (Danish)
- The
Alpino Treebank (Dutch)
- The TIGER Corpus
(German)
- Treebank Tuba-J/S
(Japanese)
- Floresta Sinta(c)tica
(Portuguese)
- Slovene
Dependency Treebank, SDT V0.1
(Slovene)
- Cast3LB
(Spanish)
- Talbanken05
(Swedish)
- METU-Sabanci
Turkish Treebank (Turkish)
2015 Subscription Members will
automatically receive two copies of this corpus. 2015 Standard Members
may request a copy as part of their 16 free membership corpora. This data
is being made available at no-cost for non-member organizations under a research license.
*
(3) GALE
Phase 3 Chinese Broadcast News Speech
was developed by LDC and is comprised of approximately 150 hours of Mandarin
Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong
University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the
DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding transcripts are
released as GALE Phase 3 Chinese Broadcast News Transcripts (LDC2015T25).
The broadcast news recordings in
this release feature news broadcasts focusing principally on current events
from the following sources: Anhui TV, China Central TV (CCTV), Phoenix TV
and Voice of America (VOA).
This release contains 279 audio
files presented in FLAC-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each
file was audited by a native Chinese speaker following Audit Procedure
Specification Version 2.0 which is included in this release.
GALE Phase 3 Chinese Broadcast News
Speech is distributed via web download.
2015 Subscription Members will
automatically receive two copies of this corpus. 2015 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.
*
(4) GALE
Phase 3 Chinese Broadcast News Transcripts
was developed by LDC and contains transcriptions of approximately 150 hours of
Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong
University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the
DARPA GALE (Global Autonomous Language Exploitation) Program.
Corresponding audio data is released
as GALE Phase 3 Chinese Broadcast News Speech (LDC2015S13).
The broadcast news recordings for
transcription feature news broadcasts focusing principally on current events
from the following sources: Anhui TV, China Central TV (CCTV), Phoenix TV
and Voice of America (VOA).
The transcript files are in
plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed
data totals 1,933,695 tokens. The transcripts were created with the
LDC-developed transcription tool, XTrans, a
multi-platform, multilingual, multi-channel transcription tool that supports
manual transcription and annotation of audio recordings.
The files in this corpus were
transcribed by LDC staff and/or by transcription vendors under contract to LDC.
Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich
transcription specification (QRTR) both of which are included in the
documentation with this release.
GALE Phase 3 Chinese Broadcast News
Transcripts is distributed via web download.
2015 Subscription Members will
automatically receive two copies of this corpus. 2015 Standard Members
may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.