Linguistic Data Consortium

Thursday, July 16, 2026

LDC July 2026 Newsletter

Fall 2026 LDC data scholarship program

New publications:

2012 NIST Speaker Recognition Evaluation Test Set

CALLHOME American English Second Edition

CALLHOME American English Lexicon (PRONLEX) Second Edition

___________________________________________________________________

Fall 2026 LDC data scholarship program

Student applications for the Fall 2026 LDC data scholarship program are being accepted now through September 15, 2026. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarship page.

New publications:

2012 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST and contains 10,321 hours of English conversational telephone speech and in-person recorded studio sessions for evaluation and modeling, along with answer keys, trial files and documentation from the NIST-sponsored 2012 Speaker Recognition Evaluation (SRE12). SRE12 introduced a revised evaluation structure in which training data for target speakers was drawn from prior SRE corpora developed by LDC and was provided in advance of the evaluation period.

Test data was drawn from Mixer 7 English Speech (LDC2025S08) and REMIX Telephone Collection (LDC2023S09). Those datasets also provided segments for modeling data; other modeling segments were drawn from Mixer 3 Speech (LDC2023S02), Mixer 4 and 5 Speech (LDC2020S03), and Mixer 6 Speech (LDC2013S03). The test data contains English speech only; some non-English speech is contained in modeling segments.

This release is comprised of 130,844 test segments, specifically, 83,778 call segments and 47,066 interview segments. Modeling data consists of 46,948 segments.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME American English Second Edition was developed by LDC and contains 56 hours of speech from 120 unscripted telephone conversations between native American English speakers. This publication is a re-release of the original CALLHOME American English collection, combining CALLHOME American English Speech (LDC97S42) and CALLHOME American English Transcripts (LDC97T14), with additional transcription and updated directory structure, file formats, and documentation.

This release contains the 120 telephone conversations published in CALLHOME American English Speech which represented training and development data and a subset of evaluation data. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for gender, language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development/evaluation partitioning was removed.

This release also features revised transcripts conforming to updated LDC transcription guidelines that addressed normalization of annotation formats, standardization of speaker-produced and background noises, application of foreign-language marking, whitespace cleanup, and corrections and consistency fixes.

The CALLHOME series consists of telephone conversations and transcripts developed by LDC and Rutgers, The State University of New Jersey, in support of research in speaker identification, language identification and related technologies. Languages in the series include American English, Egyptian Arabic, German, Japanese, Mandarin Chinese, and Spanish.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME American English Lexicon (PRONLEX) Second Edition was developed by LDC and contains 90,988 English words with citation-form pronunciations. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME American English Lexicon (PRONLEX) (LDC97L20).

The words in the lexicon were derived from Wall Street Journal text used in the continuous speech recognition publication series CSR-1 WSJ0 Complete (LDC93S6A), transcripts from the Switchboard telephone collection (LDC97S62), and transcripts representing unscripted telephone conversations between native American English speakers contained in CALLHOME American English Second Edition (LDC2026S08).

PRONLEX transcription is a phonemic transcription system designed to support speech recognition by providing a consistent and simplified representation of how words are pronounced in standard American English that allows variation to be generated later to avoid listing many pronunciation variations for each word. This single systematic base form can be expanded through rules or modeling. The transcription was created using a modified ARPABET phoneme set.

The lexicon contains three tab-separated information fields: (1) word: orthographic representation of word; (2) pron: transcribed citation-form pronunciations using modified ARPABET phoneme set; and (3) comments: (OPTIONAL) comment on the entry. It is presented as a tab-delimited TSV file encoded in UTF-8 format and includes a pronunciation dictionary derived from the lexicon in UTF-8 encoded CMUdict format.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Monday, June 15, 2026

LDC June 2026 Newsletter

Maintaining LDC organization user accounts

LDC data and commercial technology development

New publications:

KAIROS Phase 1 Evaluation Source Data, Annotation, and Assessment

Multi-Language Conversational Telephone Speech 2014 – Spanish & Portuguese

LORELEI Multiway Translated Text

________________________________________________________________

Maintaining LDC organization user accounts
LDC encourages organization account administrators to review their LDC organization user accounts at least annually to remove users who are no longer affiliated with the organization. Users no longer affiliated with an organization cannot continue to access LDC data through the organization’s LDC account. As stated in LDC’s membership agreements and license agreements, LDC data cannot be shared outside the member/licensing organization. LDC reserves the right to deactivate user accounts if any suspicious activity is detected. Visit the User Accounts page for further information on user types and privileges.

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

KAIROS Phase 1 Evaluation Source Data, Annotation, and Assessment was developed by LDC and contains the English and Spanish source data (text, video, images), manual annotations, reference knowledge graphs, the system output assessed during the evaluation, and human assessment results from the Phase 1 evaluation of the DARPA KAIROS Program. The Phase 1 evaluation focused on the improvised explosive bombing scenario with nine complex events and two surprise complex events in the mass shooting scenario.

Source data for each complex event consisted of 10-15 documents that included multimodal English and Spanish event-relevant and off-topic distractor documents. Manual annotation and assessment of event-relevant documents for 10 complex events are included in this release. Scenario-relevant events and relations were labeled for each document to develop a structured representation of temporally-ordered events, relations and arguments that expressed the scenario-relevant events in each complex event. A reference knowledge graph (Graph G) was developed for each event; systems were expected to match the Graph G with a given schema library. Assessment data includes human assessment judgments and the system output that was manually assessed for the end-to-end evaluation task.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Multi-Language Conversational Telephone Speech 2014 – Spanish & Portuguese was developed by LDC and is comprised of 123 hours of Spanish and Portuguese telephone speech. The data was collected to support research and technology evaluation in automatic language identification; portions of these recordings were used in the NIST 2015 and 2017 language recognition evaluations. The collection focused on language pair discrimination for 20 languages/dialects, some of which could be considered mutually intelligible or closely related.

This corpus contains 569 recordings covering Brazilian Portuguese, Caribbean Spanish, European Spanish and Latin American Spanish. Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 8 minutes, to each acquaintance. Human auditors labeled the calls for language, quality, callee gender, dialect type and noise.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Multiway Translated Text consists of a fixed set of English texts (around 100,000 words) translated into 24 languages. It was developed by LDC for the DARPA LORELEI Program; the translations were included in the LORELEI representative language packs created by LDC in 2016-2019.

The common word set was composed of English news documents (50%), LORELEI-domain English news documents (25%), and a phrasebook and elicitation corpus (25%). The phrasebook contained everyday colloquial phrases. The elicitation corpus was designed to represent linguistic structures. Texts were translated by a combination of professional translators and crowd-sourced translators.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Sunday, May 17, 2026

LDC May 2026 Newsletter

New publications:

MADCAT Phases 1-3 Composite Evaluation Set

CALLHOME German Second Edition

CALLHOME German Lexicon Second Edition

___________________________________________________________________

New publications:

MADCAT Phases 1-3 Composite Evaluation Set contains the evaluation data created by LDC for Phases 1-3 of the DARPA MADCAT program and the NIST OpenHaRT 2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output.
This release includes 1,643 images and corresponding annotation files. Source documents were web text and newswire collected by LDC. Arabic speaking scribes copied documents by hand, following specific instructions as to the writing style, writing implement and paper. Each page was scanned and the images annotated.

The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME German Second Edition was developed by LDC and contains 48 hours of speech from 100 unscripted telephone conversations between native German speakers. This publication is a re-release of the original CALLHOME German collection, combining CALLHOME German Speech (LDC97S43) and CALLHOME German Transcripts (LDC97T15), with additional transcription and updated directory structure, file formats, and documentation.

This release contains the 100 telephone conversations published in CALLHOME German Speech which represented training data (80 calls) and development data (20 calls). Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development partitioning was removed.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME German Lexicon Second Edition was developed by LDC and contains 318,809 German words with morphological, phonological, stress and frequency information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME German Lexicon (LDC97L18).

The words in the lexicon were derived from the CELEX German lexicon (CELEX2 (LDC96L14)) and from 100 training and development transcripts representing unscripted telephone conversations between native German speakers contained in CALLHOME German Second Edition LDC2026S04.

The lexicon has seven tab-separated information fields: (1) headword: orthographic form; (2) morph: morphological analysis of the headword; (3) pron: pronunciation of the headword; (4) stress: primary stress information of the word; (5) celex: whether the headword appears in the CELEX German lexicon; (6) train_freq: frequency of the headword in the CALLHOME training transcripts; and (7) dev_freq: frequency of the headword in the CALLHOME development transcripts. This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Wednesday, April 15, 2026

LDC April 2026 Newsletter

New publications:

DEFT Chinese and English Light and Rich ERE Parallel Annotation

MATERIAL Tagalog-English Language Pack

LORELEI Somali Representative Language Pack

____________________________________________________________________

New publications:

DEFT Chinese and English Light and Rich ERE Parallel Annotation was developed by LDC and consists of 179 Chinese discussion forum documents and their English translations annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation and event types between and among those entities including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation. 179 Chinese-English document pairs were annotated following Light ERE annotation guidelines; a subset of 171 Chinese-English document pairs were also labeled with Rich ERE annotation. The source data and English translations were drawn from BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05) originally collected and translated by LDC under the DARPA BOLT program.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

MATERIAL Tagalog-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 100 hours of Tagalog conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, 2% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

LORELEI Somali Representative Language Pack contains over 13 million words of Somali monolingual text, 800,00 words of which were translated into English, and 106,000 Somali words translated from English data. Approximately 73,000 words were annotated for simple named entities, around 23,000 words were annotated for full entity (including nominals and pronouns), and over 10,000 words were covered by noun phrase chunking annotation. Data was collected from discussion forum, news, reference, social network, and weblogs.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, March 19, 2026

LDC March 2026 Newsletter

LDC data and commercial technology development

New publications:

Ancient Chinese WordNet

CALLHOME Spanish Second Edition

CALLHOME Spanish Lexicon Second Edition

________________________________________________________________

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Ancient Chinese WordNet was developed by Nanjing Normal University and contains lexical and semantic information for Ancient Chinese vocabulary from the Pre-Qin period (before 221 BCE). The WordNet comprises 38,781 word forms and 55,100 senses, each manually linked to a corresponding synset in Princeton WordNet 1.6 and covering 22 noun categories, 15 verb categories, and additional adjective and adverb categories. The Ancient Chinese WordNet project began in 2012 with the goal of creating a structured lexical database to support linguistic research and natural language processing applications involving historical Chinese language materials.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME Spanish Second Edition was developed by LDC and contains 38 hours of speech from 120 unscripted telephone conversations between native Spanish speakers. This publication is a re-release of the original CALLHOME Spanish collection, combining CALLHOME Spanish Speech (LDC96S35) and CALLHOME Spanish Transcripts (LDC96T17), with additional transcription and updated directory structure, file formats, and documentation.

This corpus contains the 120 calls from CALLHOME Spanish Speech which represented training and development data and a subset of evaluation data. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development/test partitioning was removed.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME Spanish Lexicon Second Edition was developed by LDC and contains 45,547 Spanish words with morphological, phonological, stress and frequency information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME Spanish Lexicon (LDC96L16). The words in the lexicon were derived from 80 transcripts representing unscripted telephone conversations between native Spanish speakers contained in CALLHOME Spanish Second Edition LDC2026S04 and from various Spanish news texts.

The lexicon contains nine tab-separated information fields: (1) headword: orthographic form; (2) morph: morphological analysis of the headword; (3) pron: pronunciation of the headword; (4) stress: primary stress information of the word; (5) callh freq: frequency of the headword in CALLHOME transcripts; (6) madrid freq: frequency of the headword in Madrid Radio transcripts; (7) ap freq: frequency of the headword in Associated Press newswire; (8) reut freq: frequency of the headword in Reuters newswire; and (9) norte freq: frequency of the headword in El Norte newswire.

This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format and the grapheme-to-phoneme (G2P) tools used to automatically generate pronunciations for the original lexicon.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Monday, February 16, 2026

LDC February 2026 Newsletter

LDC membership discounts expire March 2

Spring 2026 data scholarship recipient

New publications:

2022 NIST Language Recognition Evaluation Test and Development Sets

KAIROS Schema Learning Background Source Data

LORELEI Russian Representative Language Pack

_________________________________________________________________

LDC membership discounts expire March 2

Time is running out to save on 2026 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 2 to receive a 10% discount. For more information on membership benefits and options, visit Join LDC.

Spring 2026 data scholarship recipient

Congratulations to the recipient of LDC’s Spring 2026 data scholarship:

Doma Akshitha Reddy: Chaitanya Bharathi Institute of Technology (India): Bachelor of Engineering, Information Technology. Doma is awarded copies of TIMIT Acoustic-Phonetic Continuous Speech Corpus and The CMU Kids Corpus for their work in child speech.

Since 2010, LDC has awarded scholarships to successful student applicants twice each year. To date more than 242 corpora have been distributed to 162 students across 38 countries. We proudly celebrate their achievements and the contributions their research has made to the broader community.

The next round of applications will be accepted in September 2026. For information about the program, visit the Data Scholarships page.

New publications:

2022 NIST Language Recognition Evaluation Test and Development Sets was developed by LDC and NIST and contains the test and development data, metadata, answer keys, and documentation for the 2022 NIST Language Recognition Evaluation (LRE22). The source data is comprised of 222 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) in 14 languages: Afrikaans, Tunisian Arabic, Algerian Arabic, Libyan Arabic, South African English, Indian-accented South African English, North African French, Ndebele, Oromo, Tigrinya, Tsonga, Venda, Xhosa and Zulu.

For the CTS collections, a small number of native speakers made single calls to multiple individuals in their social network. Calls lasted 8-15 minutes; speakers were free to discuss any topic. The BNBS data was collected from streaming radio programming, focused on broadcasts that included narrowband speech (e.g., call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality.

LRE22 emphasized language recognition for African languages, including low resource languages, and expanded the range of test segment durations. Further information about the 2022 evaluation can be found in the 2022 NIST Language Recognition Evaluation Plan.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

KAIROS Schema Learning Background Source Data was developed by LDC and includes 14,000 English and Spanish documents representing text, audio, video, image, and multimedia resources collected during the DARPA KAIROS program as supplemental background source data for the KAIROS Schema Learning Corpus (SLC). The purpose of the supplemental collection was to increase the amount of English and Spanish data with multimedia components for schema learning and to add domains not well represented in existing Spanish data. The supplemental data in this release includes material from the business and logistics domains, instructional documents and multimedia news.

The complete set of SLC background source data (including the data in this publication) totaled 16.2 million English, Russian and Spanish documents and more than 125,000 audio, video, image, or multimedia resources. A large portion of that data was drawn from pre-existing LDC datasets.

The SLC and KAIROS Schema Learning Complex Event Annotation (LDC2025T07), containing English and Spanish text, audio, video, and image material labeled for 93 real-world complex events, constitute the data used by KAIROS system developers for schema learning.

KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Russian Representative Language Pack contains over 1.26 billion words of Russian monolingual text, 360,00 words of which were translated into English, 3 million words of found Russian-English parallel text, and 87,000 Russian words translated from English data. Approximately 83,000 words were annotated for simple named entities, around 26,000 words were annotated for full entity (including nominals and pronouns), entity linking and situation frames (identifying entities, needs and issues) and nearly 9,000 words were covered by noun phrase chunking annotation. Data was collected from discussion forum, news, reference, social network, and weblogs.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, January 15, 2026

LDC January 2026 Newsletter

Renew your LDC membership today

New publications:

CALLHOME Japanese Second Edition

CALLHOME Japanese Lexicon Second Edition

MATERIAL Swahili-English Language Pack
_____________________________________________________________

Renew your LDC membership today
The importance of curated resources for language-related education, research and technology development drives LDC’s mission to create them, to accept data contributions from researchers across the globe and to broadly share such resources through the LDC Catalog. LDC members enjoy no-cost access to new corpora released annually, as well as the ability to license legacy data sets from among our 1000 holdings at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today.

Now through March 2, 2026, any organization that joins the Consortium or renews their membership will receive a 10% discount off the 2026 membership fee. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits.

New publications:

CALLHOME Japanese Second Edition was developed by LDC and contains 49 hours of speech from 120 telephone conversations between native Japanese speakers. This publication is a re-release of the original CALLHOME Japanese collection, combining CALLHOME Japanese Speech (LDC96S37) and CALLHOME Japanese Transcripts (LDC96T18)with additional transcription and updated directory structure, file formats, and documentation.

This corpus contains the 120 calls from CALLHOME Japanese Speech which represented training and development data and a subset of evaluation data. Participants spoke on topics of their choice in a single telephone call lasting up to 30 minutes. Calls were manually audited for language, recording quality, channel characteristics, dialect, and region. For this second edition, all audio was converted from SPHERE files to FLAC format, and the original training/development/test partitioning was removed.

2026 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLHOME Japanese Lexicon Second Edition was developed by LDC and contains 80,688 Japanese words with morphological, phonological and stress information. This second edition updates file formats, directory structure and documentation. The first edition is available as CALLHOME Japanese Lexicon (LDC96L17). The words in the lexicon were derived from 80 transcripts representing telephone conversations between native Japanese speakers contained in CALLHOME Japanese Second Edition (LDC2026S02).

The lexicon contains seven tab-separated information fields: (1) headword: orthographic form in kanji or katakana or hiragana (if only written in hiragana); (2) hiragana: orthographic form in hiragana; (3) romanization: orthographic form in romaji; (4) pron: pronunciation of the headword; (5) morph: morphological analysis of the headword; (6) train freq: frequency of the headword in the transcripts; and (7) gloss: glosses of the headword. This release also includes a pronunciation dictionary derived from the lexicon in CMUdict format and the grapheme-to-phoneme (G2P) tools used to automatically generate pronunciations for the original lexicon.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

MATERIAL Swahili-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 112 hours of Swahili conversational telephone speech, transcripts, English translations, annotations and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, 3% of which were translated into English. This release also includes domain annotations, English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2026 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.