Monday, June 18, 2018

LDC 2018 June Newsletter

LDC Catalog certified as CoreTrustSeal data repository

LDC data and commercial technology development

New Publications:
IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b
__________________________________________________________________________

LDC Catalog certified as CoreTrustSeal data repository
LDC is pleased to announce that the Catalog has been awarded the CoreTrustSeal for recognition as a trustworthy data repository. This means that the Catalog meets a series of standards covering data access, rights management, curation, and storage developed by the ISCU World Data System and the Data Seal of Approval. LDC joins the other 136 certified repositories around the globe in the commitment to promote sustainable and trustworthy data infrastructures.  

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) BOLT Chinese SMS/Chat was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The corpus contains 14,877 conversations totaling 3,005,810 words across 497,543 messages.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources – discussion forums, text messaging, and chat – in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference. The data in this release was collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants.

BOLT Chinese SMS/Chat is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Multi-Language Conversational Telephone Speech 2011 -- Central European was developed by LDC and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Human auditors labeled the calls for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:
·        Slavic Group (LDC2016S11)
·        Turkish (LDC2017S09)
·        South Asian (LDC2017S14)
·        Central Asian (LDC2018S03)

Multi-Language Conversational Telephone Speech 2011 -- Central European is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010, 2011, 2012, and 2013. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities. Also included are the source documents for the queries, specifically, English newswire, discussion forum, and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16). Also included in this package are the results of an Entity Linking IAA (Inter-Annotator Agreement) study conducted in 2010.

TAC KBP encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. English Entity Linking was first conducted as part of the 2009 TAC KBP evaluations. Its goal is to measure systems' ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base (KB) and, if so, to create a link between the two. If there is no matching node for a query entity in the KB, EL systems are required to cluster the mention together with others referencing the same entity.

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 191 hours of Cebuano conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Cebuano speech in this release represents that spoken in the Cebu-North Kana, Sialo, and Mindanao dialect regions of the Philippines. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Tuesday, May 15, 2018

LDC 2018 May Newsletter


New Publications:
__________________________________________________________

New publications:

(1) Rhythm and Pitch contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42) and Boston University Radio Speech Corpus (LDC96S36).

The RaP system permits the capture of both intonational and rhythmic aspects of speech. Four labeling tiers are used for annotating speech prosody. These tiers carry information about the syllabic organization and orthography of the speech, its rhythmic structure, tonal patterns, and other information. More information about the RaP system is available on the RaP homepage.

Speech data are presented as flac compressed 16-bit wav files. The Boston data are one channel 16kHz files, while the CALLHOME data are either one or two channel 8kHz files. Annotations are UTF-8 encoded Praat TextGrids.

Rhythm and Pitch is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(2) GALE Phase 4 Arabic Broadcast News Speech was developed by LDC and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast News Transcripts (LDC2018T14).

The recordings in this release feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Arabiya, a news television station based in Dubai; Al Baghdadya, an Iraqi broadcast programmer; Alhurra, a U.S. government-funded regional broadcaster; Al Iraqiyah, an Iraqi television station; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a national broadcast station based in Kuwait; Radio Sawa, a U.S. government-funded regional broadcaster; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Yemen TV, a television station based in Yemen.

This release contains 51 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 4 Arabic Broadcast News Speech is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) GALE Phase 4 Arabic Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast News Speech (LDC2018S05).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 204,735 tokens. The transcripts were created with the LDC tool XTrans, which supports manual transcription and annotation of audio recordings.

GALE Phase 4 Arabic Broadcast News Transcripts is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Friday, April 13, 2018

LDC 2018 April Newsletter


LDC at ICASSP 2018

LDC at the Philadelphia Science Carnival

New Publications:
_____________________________________________________________________
LDC at ICASSP 2018
LDC will be exhibiting at ICASSP 2018, held this year April 15-20 in Calgary, Canada. Stop by booth B2 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Enhancement and Analysis of Conversational Speech: JSALT 2017
Tuesday, April 17, 16:00 - 18:00
Session: Speech Analysis

Leveraging LSTM Models for Overlap Detection in Multi-Party Meetings
Wednesday, April 18, 13:30 - 15:30
Session: Speaker Diarization & Identification

A Novel LSTM-based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions
Wednesday, April 18, 13:30 - 15:30
Session: Speaker Diarization & Identification

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!  

LDC at the Philadelphia Science Carnival
LDC will share the fun of language with the community on Saturday, April 28, with a booth at the Philadelphia Science Carnival. Visitors will enjoy three language-oriented educational activities that include a language identification game and Chinese character recognition.

The Philadelphia Science Carnival is an annual event organized by Philadelphia’s Franklin Institute to acquaint children and adults with the joys of science.


New publications:

(1) Concretely Annotated New York Times was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to The New York Times Annotated Corpus (LDC2008T19). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus.
Concretely Annotated New York Times is distributed via hard drive.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed The New York Times Annotated Corpus (LDC2008T19) may request a copy of Concretely Annotated New York Times (LDC2018T12) for a $250 media fee.  Non-members may license this data for a fee.

*

(2) H2, E2, ERK1 Children's Writing was developed by the Cooperative State University Baden-W├╝rttemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in this corpus was collected by elementary schools in Baden W├╝rttemberg, Germany, and digitized at the Cooperative State University during the 2016/2017 school year. Three second, third, and fourth grade classrooms participated in the collection. Texts were written within regular class settings. The students were presented with a picture and were asked to write a story to describe the picture or, if unable to write a text, to list what they saw in the picture.

There were 173 total participants. 100 students were multilingual, and further metadata is available for 166 of the 173 children. The following is included for each text in the database: school week of collection; school type; age; gender; grade/classroom; language spoken at home; and school materials used.

LDC has also released H1 Children's Writing (
LDC2016T01).

H2, E2, ERK1 Children's Writing is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TRAD Arabic-French Parallel Text -- Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. This release consists of 398 segments (translations units) from 17 documents. The source data is Arabic newsgroup text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program.

LDC has also released TRAD Chinese-French Parallel Text -- Blog (LDC2018T02).

TRAD Arabic-French Parallel Text -- Newsgroup is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, March 15, 2018

LDC March 2018 Newsletter


New Publications:
______________________________________________________________________

New publications:

(1) BOLT Arabic Discussion Forums was developed by LDC and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The material in this release represents the unannotated Arabic source data in the discussion forum genre.

Collection was seeded based on the results of manual data scouting by native speaker annotators. Scouts were instructed to seek content in Egyptian Arabic that was original, interactive and informal. Upon locating an appropriate thread, scouts submitted the URL and some simple judgments about it to a database, via a web browser plug-in. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Arabic content. It should also be noted that many threads may contain a mixture of Egyptian and other varieties of Arabic, even among the threads that are primarily Arabic.

BOLT Arabic Discussion Forums is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 
*

(2) LORELEI Somali Representative Language Pack - Monolingual and Parallel Text was developed by LDC and is comprised of approximately 13 million words of monolingual Somali text, approximately 800,000 of which are translated into English. Another 100,000 words are also translated from English into Somali. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.

Data was collected in the following genres: discussion forums, news, reference, social network and weblog. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods, which are detailed in the included documentation. All harvested content was initially converted from its original HTML form into a relatively uniform XML format. Also included in this release are two tools: one to recreate original source data from the processed XML material and the other to condition text data users download from Twitter.

LORELEI Somali Representative Language Pack - Monolingual and Parallel Text is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
 *

(3) SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora and separated into development and test sets.

Reference translations from machine translation evaluation corpora were used as sentential paraphrases. They were sourced from the following data sets released by LDC from the NIST (National Institute of Standards and Technology) open machine translation evaluation series (OpenMT): LDC2010T14, LDC2010T17, LDC2010T21, and LDC2013T03.

Reference translations of 10 to 30 words were randomly extracted for annotation from NIST OpenMT corpora. Gold standard annotations of HPSG (head-driven phrase structure grammar) trees and phrase alignments were performed, resulting in 20,276 phrases extracted from 201 sentential paraphrases and 15,721 paraphrase alignments.

SPADE is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, February 16, 2018

LDC February 2018 Newsletter

Only two weeks left to enjoy 2018 membership discounts
Spring 2018 LDC Data Scholarship recipients
Only two weeks left to enjoy 2018 membership discounts

There is still time to save on 2018 membership fees. Through March 1, all organizations receive a discount on the 2018 membership fee (up to 10%) when they choose to join or renew.

For more information on membership benefits, visit Join LDC.


Spring 2018 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2018 Data Scholarship:

Margarida Madaleno: London School of Economics, PhD Economic Geography. Madelano is awarded a copy of Treebank 3 for her research in emotional well-being.

Gary Munnelly: Trinity College Dublin, PhD Computer Science and Statistics. Munnelly is awarded a copy of the New York Times Annotated Corpus for his research in named entity recognition and disambiguation in cultural heritage data sets.

Barlian Henryanu Prasetio: University of Miyazaki, PhD Environmental Robotics. Prasetio is awarded copies of SUSAS and SUSAS Transcripts for his work in voice stress recognition systems.

For information about the program, visit the Data Scholarship page.


LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 – Central Asian was developed by LDC and is comprised of approximately 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance.
LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:
Multi-Language Conversational Telephone Speech 2011 – Central Asian is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text was developed by LDC and is comprised of approximately 25 million words of monolingual Amharic text, approximately 600,000 of which are translated into English. Another 80,000 words are also translated from English into Amharic. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.

Data was collected in the following genres: discussion forums, news, reference, social network and weblog. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods, which are detailed in the included documentation. All harvested content was initially converted from its original HTML form into a relatively uniform XML format. Also included in this release are two tools: one to recreate original source data from the processed XML material and the other to condition text data users download from Twitter.

LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by LDC and contains the 3,877,207 English source documents used in support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

The source data consists of newswire, broadcast material, and web text collected by LDC. Documents are released as a collection of zip files for overall compactness, and ease and efficiency of use. When unpacked, the documents are all UTF-8 text files with a basic markup structure.

TAC KBP Comprehensive English Source Corpora 2009-2014 is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(4) IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 200 hours of Tok Pisin conversational and scripted telephone speech collected in 2013 along with corresponding transcripts.

The Tok Pisin speech in this release represents that spoken in the Papuan dialect region of Papua New Guinea. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, January 16, 2018

LDC January 2018 Newsletter

Membership Discounts for MY2018 Still Available

New Publications:
___________________________________________________________________________

Membership Discounts for MY2018 Still Available
Join LDC while membership savings are still available. Now through March 1, 2018, renewing MY2017 members will receive a 10% discount off the membership fee. New or non-consecutive member organizations will receive a 5% discount. Membership remains the most economical way to access LDC releases. This year’s planned publications include Multilanguage Conversational Telephone Speech, IARPA Babel Language Packs (telephone speech and transcripts), DIRHA (Distant-speech Interaction for Robust Home Applications), TRAD (Chinese-French and Arabic-French parallel text), data from BOLT, DEFT, LORELEI, RATS and TAC KBP, and more. Browse the Members pages for details on membership options and benefits. 

New publications:

(1) DEFT Spanish Treebank was developed by LDC and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text and Latin American Spanish discussion forum data created for the DARPA Deep Exploration and Filtering of Text (DEFT) program. DEFT Spanish Treebank supported the program's goal of deep natural language understanding.

Newswire source files were selected from Spanish Gigaword Third Edition (LDC2011T12) and were manually sentence-segmented for DEFT. Discussion forum source files were selected from Spanish discussion forum source data collected by LDC, consisting of continuous multi-posts of 100-1000 words.

This release contains 114 files (54,394 tokens) of newswire data and 60 files (55,307 tokens) of discussion forum data all of which were annotated with constituents and syntactic functions.

DEFT Spanish Treebank is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) DIRHA English WSJ Audio was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project which addressed natural spontaneous speech interaction with distant microphones in a domestic environment. It is comprised of approximately 85 hours of real and simulated read speech by six native American English speakers. The target utterances were taken from CSR-I (WSJ0) Complete (LDC93S6A), specifically, the 5,000 word subset of read speech from Wall Street Journal news text.

Speech was collected in a real apartment setting with typical domestic background noise and inter/intra-room reverberation effects. Annotations, speaker metadata and images of the apartment setting are also included.

DIRHA English WSJ Audio is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TRAD Chinese-French Parallel Text -- Blog was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Chinese words from GALE Phase 1 Chinese Blog Parallel Text (LDC2008T06).

The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.

The source data for TRAD Chinese-French Parallel Text is Chinese blog text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.

TRAD Chinese-French Parallel Text -- Blog is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, December 15, 2017

LDC December 2017 Newsletter

Spring 2018 LDC Data Scholarship Program - deadline approaching

Lingo Boingo: a web portal to language games

__________________________________________________________________________

Spring 2018 LDC Data Scholarship Program - deadline approaching
Students can apply for the Spring 2018 Data Scholarship Program now through January 15, 2018. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships

Lingo Boingo: a web portal to language games
LDC is pleased to announce a new collaborative project, Lingo Boingo (https://lingoboingo.org/), a web portal that brings together new and existing language games that are fun to play and that provide useful annotations and judgments for linguistic research. Gamers and grammar lovers can choose from a list of challenging games, which will continue to expand through the efforts of LDC and external collaborators. For more information, contact jfiumara@ldc.upenn.edu. Start playing today!

Renew your LDC membership today
Membership Year 2018 (MY2018) is open for joining and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2018, current MY2017 members who renew before March 1, will receive a 10% discount off of the membership fee. New or returning organizations will receive a 5% discount through March 1. 

In addition to receiving new publications, current year LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 700 holdings; current year for-profit members may use most data for commercial applications. Visit Join LDC for details on membership, user accounts and payment.

Plans for MY2018 publications are in progress. Among the expected releases are:
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Central Asian, Central European language groups)
  • DIRHA (Distant-speech Interaction for Robust Home Applications):  Wall Street Journal read speech with noise and reverberation, suitable for various multi-microphone signal processing and distant speech recognition tasks
  • TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web data)
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin
  • BOLT: discussion forum, SMS, word-aligned, and tagged data in all languages (Egyptian Arabic, English, Chinese)
  • DEFT: Spanish Treebank (newswire, web data)
  • RATS:  Language Identification data set (Dari, Farsi, Levantine Arabic, Pashto, Urdu; degraded audio signals)
  • TAC KBP: comprehensive English source and entity linked data (broadcast, telephone speech, newswire, web data)
  • German children’s handwriting: longitudinal study of weekly writing in classroom setting with enhanced output for specific spelling patterns

New publications:

(1) CHiME3 was developed as part of The 3rd CHiME Speech Separation and Recognition Challenge and contains approximately 342 hours of English speech and transcripts from noisy environments and 50 hours of noisy environment audio. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments. CHiME3 involved two types of data: speech data recorded in very noisy environments (on a bus, in a cafe, pedestrian area, and street junction) and noisy utterances generated by artificially mixing clean speech data with noisy backgrounds.

Data is divided into training, development and test sets. All data is provided as 16 bit WAV files sampled at 16 kHz. The audio data consists of the background noises, enhanced speech data using the baseline speech enhancement technique, unsegmented noisy speech data, and segmented noisy speech data.

LDC has also released two CHiME2 corpora -- CHiME2 Grid and CHiME2 WSJ0.

CHiME3 is distributed via USB drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(2) GALE Phase 4 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 134 hours of Mandarin Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News Transcripts (LDC2017T18).

The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: China Central TV (CCTV), a national and international broadcaster in Mainland China; Phoenix TV, a Hong Kong-based satellite television station; and Voice of America (VOA), a U.S. government-funded broadcast programmer.

This release contains 256 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 4 Chinese Broadcast News Speech is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 
*
 
(3) GALE Phase 4 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 134 hours of Chinese broadcast news speech collected in 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast News Speech (LDC2017S25).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,696,879 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings.

GALE Phase 4 Chinese Broadcast News Transcripts is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.