Linguistic Data Consortium: November 2018

Join LDC for Membership Year 2019

Spring 2019 Data Scholarship Program

Commercial use and LDC data

New publications:

AISHELL-1

Avatar Education Portuguese

BOLT Egyptian Arabic Treebank - Discussion Forum

IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a

_________________________________________________________________

Join LDC for Membership Year 2019

Membership Year 2019 (MY2019) is open and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2019, current MY2018 members who renew their LDC membership before March 1 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 1.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 750 holdings. Current-year for-profit members may use most data for commercial applications.

Plans for MY2019 publications are in progress. Among the expected releases are:

SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation

Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)

Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)

TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data

CALLFRIEND Second Edition: updated releases with .wav format audio, simplified directory structure and enhanced documentation and metadata (English, Egyptian Arabic, Mandarin Chinese-Taiwan)

HAVIC Med Progress Test data: English web video, metadata, and annotations for developing multimedia systems

IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian

BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)

And, it’s not too late to join for MY2017 (through December 31, 2018) and MY2018 (through December 31, 2019). Data sets from those years include 2010 NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting and Language Identification releases, CHiME, Noisy TIMIT Speech, Concretely Annotated New York Times and English Gigaword, DIRHA English WSJ Audio, LORELEI Amharic and Somali Language Packs and DEFT Spanish Treebank. For full descriptions of all LDC data sets, browse our Catalog.

Visit Join LDC for details on membership, user accounts and payment.

Spring 2019 Data Scholarship Program

Applications are now being accepted through January 15, 2019 for the Spring 2019 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) AISHELL-1 was developed by Beijing Shell Shell Technology Co., Ltd. It contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts.

The goal of the collection was to support speech recognition system development in 11 domains, including smart homes, autonomous driving, entertainment, finance and science and technology. Participants read 500 sentences covering the domains; sentences were chosen for their speech and phonetic characteristics. The speech was recorded in a quiet indoor environment on a high fidelity microphone and two mobile phones (Android and IOS).

Speakers were recruited from different accent areas across China, including North, South and Yue-Gui-Min regions. There were 214 female speakers and 186 male speakers. Additional demographic information about the participants is included in this release.

AISHELL-1 is distributed via hard drive.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning.

The corpus contains 1,400 speakers (700 male, 700 female) who generated 1,400 utterances from read and spontaneous speech. Utterances were transcribed at the word level (without time alignments) and at the phoneme level (with time alignment labels).

Avatar Education Portuguese is distributed via web download.

(3) BOLT Egyptian Arabic Treebank - Discussion Forum was developed by LDC and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation collected for the DARPA Broad Operational Language Translation (BOLT) Program.

The annotations in this release follow Penn Arabic Treebank (PATB) annotation guidelines. There are two kinds of morphological analysis synchronized in the corpus. LDC Standard Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01) was used for Modern Standard Arabic tokens, and CALIMA(Columbia Arabic Language and dIalect Morphological Analyzer) was used for Egyptian-Arabic tokens.

This release contains 440,448 tokens before clitics were split and 508,548 tree tokens after clitics were split for treebank annotation. The source material is web discussion forums collected by LDC from various sources.

The unannotated Egyptian Arabic source data is released as BOLT Arabic Discussion Forums (LDC2018T10).

BOLT Egyptian Arabic Treebank - Discussion Forum is distributed via web download.

(4) IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Telugu speech in this release represents that spoken in the Central, East, South and North Telugu dialect regions of India.The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Thursday, November 15, 2018

LDC 2018 November Newsletter