Thursday, March 19, 2026

LDC March 2026 Newsletter

LDC data and commercial technology development 

New publications:

Ancient Chinese WordNet

CALLHOME Spanish Second Edition

CALLHOME Spanish Lexicon Second Edition

________________________________________________________________

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Ancient Chinese WordNet was developed by Nanjing Normal University and contains lexical and semantic information for Ancient Chinese vocabulary from the Pre-Qin period (before 221 BCE). The WordNet comprises 38,781 word forms and 55,100 senses, each manually linked to a corresponding synset in Princeton WordNet 1.6 and covering 22 noun categories, 15 verb categories, and additional adjective and adverb categories. The Ancient Chinese WordNet project began in 2012 with the goal of creating a structured lexical database to support linguistic research and natural language processing applications involving historical Chinese language materials.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

CALLHOME Spanish Second Edition was developed by LDC and contains 38 hours of speech from 120 unscripted telephone conversations between native Spanish speakers. This publication is a re-release of the original CALLHOME Spanish collection, combining CALLHOME Spanish Speech (LDC96S35) and CALLHOME Spanish Transcripts (LDC96T17), with additional transcription and updated directory structure, file formats, and documentation.

This corpus contains the 120 calls from CALLHOME Spanish Speech which represented training and development data and a subset of evaluation data. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development/test partitioning was removed. 

This release also features revised transcripts conforming to updated LDC transcription guidelines that addressed normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes.

The CALLHOME series consists of telephone conversations and transcripts developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

CALLHOME Spanish Lexicon Second Edition was developed by LDC and contains 45,547 Spanish words with morphological, phonological, stress and frequency information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME Spanish Lexicon (LDC96L16). The words in the lexicon were derived from 80 transcripts representing unscripted telephone conversations between native Spanish speakers contained in CALLHOME Spanish Second Edition LDC2026S04 and from various Spanish news texts.

The lexicon contains nine tab-separated information fields: (1) headword: orthographic form; (2) morph: morphological analysis of the headword; (3) pron: pronunciation of the headword; (4) stress: primary stress information of the word; (5) callh freq: frequency of the headword in CALLHOME transcripts; (6) madrid freq: frequency of the headword in Madrid Radio transcripts; (7) ap freq: frequency of the headword in Associated Press newswire; (8) reut freq: frequency of the headword in Reuters newswire; and (9) norte freq: frequency of the headword in El Norte newswire.

This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format and the grapheme-to-phoneme (G2P) tools used to automatically generate pronunciations for the original lexicon.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

No comments:

Post a Comment