Linguistic Data Consortium: September 2023

Friday, September 15, 2023

LDC September 2023 Newsletter

LDC data and commercial technology development

New publications:

CALLFRIEND Russian Text
________________________________________________________________

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

CALLFRIEND Russian Speech was developed by LDC and consists of 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the CALLFRIEND collection, a project designed primarily to support research in automatic language identification. One hundred native Russian speakers living in the continental United States each made a single phone call, lasting up to 30 minutes, to a family member or friend living in the United States.

All recordings involved domestic calls routed through LDC’s automated telephone collection platform and stored as 2-channel (4-wire) 8-KHz mu-law samples taken directly from a public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data.

This release includes call metadata, including speaker gender, the number of speakers on each channel and call duration.

Corresponding transcripts and a lexicon are available in CALLFRIEND Russian Text (LDC2023T09).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLFRIEND Russian Text contains the corresponding transcripts and a lexicon for CALLFRIEND Russian Speech, that is, 48 hours of telephone conversations (100 recordings) between native Russian speakers.

The transcripts have four main fields on each line (begin_offset, end_offset, speaker_label, transcript_text) separated by tabs. Each contains a list of time-stamped segments in order according to their begin_offset values, with no blank lines.

The lexicon covers the word forms in the 97 transcript files. The main lexicon table contains three columns per row: Cyrillic orthography, phonetic transliteration and numeric representation of syllabic stress.

Corresponding speech data is available as CALLFRIEND Russian Speech (LDC2023S08).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.