Linguistic Data Consortium: March 2021

LDC data and commercial technology development

New Publications:
Columbia Games Corpus
Global TIMIT Mandarin Chinese
BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

_________________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
(1) Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation from 13 subjects playing a series of computer games that required verbal communication to achieve joint goals of identifying and moving images on the screen to reach a combined number of points. This publication also includes corresponding manually time-aligned orthographic transcripts and annotation marking discourse and turn-taking.

Columbia Games Corpus is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Global TIMIT Mandarin Chinese was developed by LDC and Shanghai Jiao Tong University and consists of five hours of read speech from Chinese Gigaword Fifth Edition (LDC2011T13) with corresponding transcripts. Fifty speakers read 120 sentences; specifically, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types.

The corpus was recorded at Shanghai Jiao Tong University, China. Speakers (25 female, 25 male) were students at the university and had achieved Class 2 Level 1 or better on Putonghua Shuiping Ceshi (the national standard Mandarin proficiency test).

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems.

Global TIMIT Mandarin Chinese is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Chinese informal text.

Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation (i.e., Chinese Treebank 9.0 (LDC2016T13)) and covers noun phrases (including proper nouns, nominals, pronouns, and null arguments), possessives, proper noun pre-modifiers, and verbs.

Discussion forum data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. Telephone speech data was taken from LDC's Chinese CALLHOME and CALLFRIEND telephone collections.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, March 15, 2021

LDC 2021 March Newsletter