In this newsletter:
Fall
2016 LDC Data Scholarship recipients
Chilin
HK and LDC partner on distribution of parallel patent data
New publications:
Fall 2016 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Fall 2016 data scholarships:
Tiba Zaki Abdulhameed: Western Michigan University (USA); PhD Candidate, Computer Science. Tiba is awarded copies of GALE Phase 2 Arabic Broadcast Conversation Speech and Transcripts for her research in dialectal ASR.
Abhishek Abhishek: Indian Institute of Technology Guwahati (India); PhD Candidate, Computer Science and Engineering. Abhishek is awarded a copies of ACE 2004 Multilingual Training Corpus and The New York Times Annotated Corpus for his research in coreference resolution and relation extraction.
Sara Ebrahim: Ain Shams University (Egypt); Msc, Computer Science. Sara is awarded copies of LDC Standard Arabic Morphological Analyzer and NIST OpenMT 2008 Evaluation Selected References and System Translations for her work in machine translation.
Katherine Metcalf: Indiana University (USA), PhD Candidate, Computer Science. Katherine is awarded a copy of Emotional Prosody Speech and Transcripts for her research in acoustic/prosodic approaches to classifying emotional states.
Mousmita Sarma: Gauhati University (India), Post-Masters Research, Electronics and Communications Technology. Mousmita is awarded copies of Switchboard 1-Release 2 and IARPA Babel Assamese Language Pack for her research in Assamese dialect identification.
For program information visit the Data
Scholarship page.
Chilin
HK and LDC partner on distribution of parallel patent data
Chilin HK Limited (Chilin) and LDC are pleased to announce
that the parallel data source developed by Chilin, A Corpus of Chinese-English
Parallel Sentences Extracted from Patents, is now available through the LDC
Catalog. This is a special release in addition to the LDC scheduled corpora for
membership year 2016, available under separate terms.
The Chilin Corpus has
primarily resulted from training corpus and test sets developed
specifically for the Tokyo-based NTCIR 2009 & 2010 competitions on Patent
MT (machine translation), which drew more than 30 international teams:
NTCIR-9: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/NTCIR/01-NTCIR9-PATENTMT-GotoI.pdf
The training corpus is drawn from a much larger curated
corpus of parallel Chinese-English sentences and sentence fragments which have
been winnowed from an even larger corpus of more than 300k parallel
Chinese-English patents in different fields, initially at the Research Centre on
Language Information Sciences, City University of Hong Kong (authors: Benjamin Tsou, Bin Lu, and Kapo Chow). This
data set is available from LDC under the following reference:
Not-for-profit organizations may license this data set for
US$25.00 under the LDC Not-for-Profit Membership Agreement or under the LDC
User Agreement for Non-Members for use in linguistic research, education and
non-commercial technology development. For-profit organizations may license
this data for US$5000, discounted to US$4000 for LDC for-profit members, under
a commercial license.
New Corpora
(1) IARPA Babel Turkish Language Pack
IARPA-babel105b-v0.5 was developed by Appen for the IARPA (Intelligence
Advanced Research Projects Activity) Babel program. It contains approximately
213 hours of Turkish conversational and scripted telephone speech collected in
2012 along with corresponding transcripts.
The Babel program focuses on underserved languages and
seeks to develop speech recognition technology that can be rapidly applied to
any human language to support keyword search performance over large amounts of
recorded speech.
The Turkish
speech in this release represents that spoken in seven dialect regions in
Turkey. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 70
years. Calls were made using different telephones (e.g., mobile,
landline) from a variety of environments including the street, a home or
office, a public place, and inside a vehicle.
Transcripts are encoded in UTF-8.
IARPA Babel Turkish Language Pack IARPA is distributed
via web download.
2016 Subscription Members will receive two copies of this
corpus provided they have submitted
a completed copy of the special license agreement. 2016 Standard Members
may request a copy as part of their 16 free membership corpora. Non-members may
license this data for a fee.
(2) KAFD: Arabic Font Database was developed by King Fahd University of Petroleum & Minerals and Qassim University. It is comprised of approximately 2.5 million scanned Arabic printed pages in a variety of fonts, sizes and resolutions along with corresponding transcripts. KAFD was designed for research in Arabic text recognition.
The scanned Arabic texts were collected from publications covering various subjects such as religion, medicine, science and history. Texts were printed in 40 different fonts, 10 sizes and four styles. Scans were made at 100, 200, 300 and 600 dpi (dots per inch).
The database is available in two formats: at the page level and at the line level. Images are presented as TIFF images and transcripts are in plain text format. Individual font folders are compressed into RAR archives.
The data is divided into training, validation and test sets.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.
(3) Richer Event Description was developed by the University of Colorado Boulder-CLEAR (Computational Language and Education Research), Carnegie Mellon University and LDC. It consists of coreference, bridging and event-event relations (temporal, causal, subevent and reporting relations) annotations over 95 English newswire, discussion forum and narrative text documents, covering all events, times and non-eventive entities within each document.
RED annotation is intended to join different annotation layers and to provide a rich representation of event phenomena.
Documents were annotated twice -- in a markable pass and in an event annotation phase. Annotation and source documents are divided into three partitions: (1) 20 newswire summarization documents, (2) 20 discussion forum documents and newswire annotations used in the original RED pilot annotations, and (3) 55 documents annotated by a range of DEFT (Deep Exploration and Filtering of Test) annotation formats. Data is presented as UTF-8 encoded xml and plain text.
2016 Subscription Members will automatically receive two
copies of this corpus. 2016 Standard Members may request a copy as part
of their 16 free membership corpora. Non-members may license this data
for a fee.