LDC data and commercial technology development
New publications:
2015 NIST Language Recognition Evaluation Test Set
The Xi’an Multi-Language Learner Corpus
_________________________________________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
New publications:
2015 NIST Language Recognition Evaluation Test Set was developed by LDC and NIST. It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation (LRE), approximately 867 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin American, Brazilian Portuguese); English (British, Indian, General American English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); and French (West African, Haitian Creole).
The CTS data includes calls between individuals in the same social networks lasting 8-15 minutes and telephone speech from the IARPA Babel series collected in 2012-2013 from speakers using a range of phone types in diverse settings with varying noise conditions. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g., call-ins to a talk show).
The goal of NIST's LRE evaluations is to establish the baseline of current performance capability for CTS language recognition and to lay the groundwork for further research efforts. LRE15 expanded the range of test segment durations and added a test condition that allowed systems to make use of unrestricted training data when developing models
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
Data was collected in 2023 and 2024 from students at XISU and Yunnan Minzu University (YMU) who were linguistic majors or studying one of the foreign languages available at XISU and YMU. Off-topic essays and incomplete texts were excluded.
2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.