Monday, February 15, 2021

LDC 2021 February Newsletter

2021 Membership Discounts Expire March 1 

New Publications:
Althingi Parliamentary Speech
Penn Discourse Treebank 2.0 – German Translation 
TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010
________________________________________________________________________


2021 Membership Discounts Expire March 1

Time is running out to save on 2021 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC

New publications:


(1) Althingi Parliamentary Speech consists of approximately 540 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary, and language models. Speeches date from 2005-2016. This data set was collected in 2016 by the ASR for Althingi project at Reykjavik University in collaboration with the Althingi speech department. The purpose of that project was to develop an ASR (automatic speech recognition) system for Icelandic parliamentary speech to replace the procedure of manually transcribing performed speeches. 

The mean speech length is 6 minutes, with speeches ranging from under 1 minute up to around 30 minutes. The corpus features 197 speakers (105 male, 92 female) and is split into training, development, and evaluation sets. 

Althingi Parliamentary Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*


(2Penn Discourse Treebank 2.0 – German Translation  was developed at the University of Potsdam’s Applied Computational Linguistics group and consists of approximately one million tokens derived from Penn Discourse Treebank Version 2.0 (LDC2008T05) translated into German and annotated for shallow discourse relations. The aim of the Penn Discourse Treebank  project is to annotate the Wall Street Journal section in Treebank-2 (LDC95T7) with discourse relations. PDTB-German is based on a subset of PDTB2.0 used in the 2016 CoNLL Shared Task on Multilingual Shallow Discourse Parsing.

Data is in CoNLL format. Text was automatically translated with deepL, and projections of the annotations using word alignments were produced with GIZA++.

Penn Discourse Treebank 2.0 – German Translation is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*


(3) TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010 contains the training and evaluation data (queries, manual runs, final assessment results) produced by LDC to support the 2010 Surprise Slot Filling Track, the only year in which the track was run. 

The regular English Slot Filling track involved mining information about entities from text using a specified set of "slots" or attributes. The goal of the Surprise Slot Filling task was to support the development of information extraction systems that could rapidly adapt to new types of relations and events. Surprise Slot Filling participants were given four new slot types -- "diseases", "awards-won" and "charity-supported" for persons, and "products" for organizations -- along with annotation guidelines and training data. They were instructed to develop their systems and to run them on the source collection in four days.

The corresponding source document collections cover English newswire, broadcast material, and web text. These documents  are included in TAC KBP Comprehensive English Source Corpora 2009-2014 (LDC2018T03). The corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - is contained in TAC KBP Reference Knowledge Base (LDC2014T16) .

TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

 

Friday, January 15, 2021

LDC 2021 January Newsletter

Renew Your LDC Membership Today

New Publications:
LORELEI Akan Representative Language Pack
ATIS – Seven Languages
BOLT English Treebank – SMS/Chat

_____________________________________________________________________


Renew Your LDC Membership Today 
Curated language resources are more important than ever to support research and language technology development, including the expanding fields around remote work, pandemic-related technologies, and non-contact interactions. LDC members enjoy no-cost access to 30+ new corpora released annually, as well as the ability to license legacy data sets at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today. 

Now through March 1, 2021, 2020 members receive a 10% discount on 2021 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 


New publications:


(1) LORELEI Akan Representative Language Pack consists of Akan monolingual text, Akan-English parallel text, annotations, supplemental resources, and related software tools developed by LDC for the DARPA LORELEI program.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons, and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

Data was collected from discussion forum, news, reference, social network, and weblog. Data volumes are as follows:

  • Over 3.3 million words of Akan monolingual text, all of which were translated into English
  • 115,000 Akan words translated from English data


Approximately 2,300 words were annotated for named entities, full entity including nominals and pronouns, entity linking, simple semantic annotation, and situation frame annotation (identifying entities, needs, and issues). Around 2,000 words have morphological segmentation annotation.

LORELEI Akan Representative Language Pack is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*


(2) ATIS – Seven Languages was developed by Amazon Web Services, Inc. and consists of 5,871 English utterances from ATIS (Air Travel Information Services) corpora, specifically ATIS2 (LDC93S5)ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26), translated into six languages: Spanish, German, French, Portuguese, Chinese, and Japanese.

The ATIS collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory of Computer Science, National Institute for Standards and Technology, and SRI International.

The data is separated into 4,978 utterances for training and 893 utterances for testing following the original ATIS division. The source English utterances were manually translated into the six languages and are included in this release. Each utterance was annotated with named entities via table lookup; markers include city, airline, airport names, and dates.

ATIS Seven Languages is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*


(3) BOLT English Treebank – SMS/Chat was developed by LDC and consists of English SMS and text chat data with part-of-speech and syntactic structure annotation.

The source data consists of 115,667 tokens/words in 484 files of English SMS and text chat collected by LDC using two methods: new collection via LDC's collection platform and donation of SMS or chat archives from BOLT collection participants. 

All data was annotated for word-level tokenization, part-of-speech, and syntactic structure. Annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Those changes primarily concerned the tokenization of hyphenated words, part-of-speech, and tree changes necessitated by the tokenization changes, and updates to the syntactic annotation to comply with updated annotation guidelines. Supplementary guidelines for English treebanks and web text are included with this release.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT English Treebank – SMS/Chat is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

 

Tuesday, December 15, 2020

LDC 2020 December Newsletter

LDC 2021 Membership Discounts Now Available 
Approaching Deadline for Spring 2021 Data Scholarship Applications
LDC Closed for Winter Break Dec. 24- Jan. 5

New Publications:
BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech 
Phonemes of Arabic
Global TIMIT Mandarin Chinese – Guanzhong Dialect

_______________________________________________________________________

LDC 2021 Membership Discounts Now Available
Now through March 1, 2021, current 2020 members receive a 10% discount for renewing their membership and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

Approaching Deadline for Spring 2021 Data Scholarship Applications 
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2021 data scholarships are due January 15, 2021. For more information on requirements and program rules, see LDC Data Scholarships.

LDC Closed for Winter Break Dec. 24-Jan. 5
LDC will be closed from Thursday, December 24, 2020 through Tuesday, January 5, 2021 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 6, 2021. Requests received by the Membership Office during Winter Break will be processed when the office reopens.  


New publications:
(1) BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies for the BOLT co-reference task and consists of co-reference annotation on English discussion forum, SMS/Chat, and conversational telephone speech. 

Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation and covers noun phrases (including proper nouns, nominals, pronouns, and null arguments), possessives, proper noun pre-modifiers, and verbs. 

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Phonemes of Arabic was developed at the Florida Institute of Technology. It contains approximately one hour of speech from native Arabic speakers that includes all Arabic sounds (consonants and vowels) and 24 words with specific consonant-vowel patterns. 

 

Arabic has three short vowels, three long vowels, and 28 consonants. Speakers recorded all sounds, repeating each sound three times. Each speaker also recorded 24 Arabic words with a specified consonant-vowel pattern and repeated each word three times. The speakers (19 male) were from the following countries: Egypt, Iraq, Lebanon, Libya, Morocco, Saudi Arabia, and Syria.


Phonemes of Arabic is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(3) Global TIMIT Mandarin Chinese – Guanzhong Dialectwas developed by LDC and Xi’an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province. It is comprised of 50 speakers reading 120 sentences from Chinese Gigaword Fifth Edition (LDC2011T13). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types. 

The corpus was recorded at Xi’an Jiaotong University, Xi’an, China. Speakers (25 female, 25 male) were born in Weinan, Shannxi and spoke the Guanzhong dialect. 

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT data set, which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, those features include:

  • A large number of fluently read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns
  • A relatively large number of speakers
  • Time-aligned lexical and phonetic transcription of all utterances
  • Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker


Global TIMIT Mandarin Chinese – Guanzhong Dialect is distributed via web download.  

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, November 16, 2020

LDC 2020 November Newsletter

Join LDC for Membership Year 2021
Spring 2021 Data Scholarship Application Deadline

New Publications:
Global TIMIT Learner Simple English
LORELEI Ukrainian Representative Language Pack 
________________________________________________________________________

Join LDC for Membership Year 2021 

Membership Year 2021 (MY2021) is open and discounts are available for those who keep their membership current and join early. Current MY2020 members who renew their LDC membership before March 1, 2021 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount when joining by March 1.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 850 holdings. Current-year for-profit members may use most data for commercial applications.

Plans for MY2021 publications are in progress. Among the expected releases are:

  • Global TIMIT Mandarin Chinese: 6,000 linguistically rich utterances featuring time-aligned lexical and phonetic transcription
  • Columbia Games Corpus: 12 spontaneous task-oriented dyadic conversations elicited from native Standard American English speakers playing computer games, transcribed and annotated for discourse/pragmatic phenomena
  • My Science Tutor Children’s Conversational Speech: 400+ hours of speech from 1,371 US third, fourth, and fifth grade students participating in sessions with a virtual science tutor, transcripts included
  • The SSNCE Database of Tamil Dysarthric Speech: Tamil speech from 20 dysarthric speakers aged 12-40 years and a control group (10 speakers) with time-aligned phonetic transcripts
  • Icelandic Parliamentary Speech: 6,493 Icelandic Parliament recordings from 2005-2016 with 196 speakers, aligned and segmented and divided into training, development, and evaluation sets for ASR development
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools (Akan, Kinyarwanda, and Wolof)
  • BOLT: co-reference, treebank, propbank, and translation resources for discussion forum, SMS/Chat, and conversational telephone speech data in all languages (Chinese, Egyptian Arabic, and English)
  • TAC KBP: training and evaluation data for English surprise slot filling (2010) and English sentiment slot filling (2013-2014) tasks 


It’s also not too late to join for MY2019 (through December 31, 2020) and MY2020 (through December 31, 2021). Data sets from those years include Penn Discourse Treebank Version 3.0, DEFT Committed Belief Annotation (Chinese, English, Spanish), 2018 NIST Speaker Recognition Evaluation Test Set, Mixer 4 and 5 Speech, AMR Annotation Release 3.0, and Penn Parsed Corpora of Historical English.

For full descriptions of all LDC data sets, browse our Catalog.  

Visit Join LDC for details on membership, user accounts and payment.


Spring 2021 Data Scholarship Application Deadline
Applications are now being accepted through January 15, 2021 for the Spring 2021 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.


New publications:
(1) Global TIMIT Learner Simple English was developed by LDC and Shanghai Jiao Tong University and consists of approximately 12 hours of L1 and L2 English read speech and transcripts. It is comprised of two separate data sets of 50 speakers reading 120 sentences from TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) deemed “simple” to read by Chinese learners of English. Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 820 sentence types. 


L1 Simple English was recorded at the University of Pennsylvania, USA; participants were 25 female and 25 male native American English speakers. L2 Simple English was recorded at Shanghai Jiao Tong University, China. L2 speakers (25 female, 25 male) were Chinese learners of English considered fluent.


The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT data set, which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, those features include:

  • A large number of fluently read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns
  • A relatively large number of speakers
  • Time-aligned lexical and phonetic transcription of all utterances
  • Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker

Global TIMIT Learner Simple English is distributed via web download.  

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) LORELEI Ukrainian Representative Language Pack consists of Ukrainian monolingual text, Ukrainian-English parallel and comparable text, annotations, supplemental resources, and related software tools developed by LDC for the DARPA LORELEI program.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons, and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.


Data was collected from discussion forum, news, reference, social network, and weblog. Data volumes are as follows:

  • 111 million words of Ukrainian monolingual text, approximately 700,000 words of which were translated into English
  • 86,000 Ukrainian words translated from English data
  • 174,000 words of found parallel text
  • over 2,000,000 words of comparable text

Approximately 75,000 words were annotated for named entities and up to 50,000 words contain additional annotation, including situation frames (identifying entities, needs and issues) and entity linking and detection.

LORELEI Ukrainian Representative Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TAC KBP Event Argument – Comprehensive Training and Evaluation Data 2016-2017 was developed by LDC and contains training and evaluation data produced in support of the 2016 TAC KBP Event Argument Linking Pilot and Evaluation tasks and the 2017 Event Argument Linking Training Evaluation task. 

The Event Argument Extraction and Linking task required systems to extract event arguments (entities or attributes playing a role in an event) from unstructured text, indicate the role they play in an event, and link the arguments appearing in the same event to each other. Since the extracted information must be suitable as input to a knowledge base, systems constructed tuples indicating the event type, the role played by the entity in the event, and the most canonical mention of the entity from the source document. The event types and roles were drawn from an externally specified ontology of 31 event types, which included financial transactions, communication events, and attacks.

This corpus includes source documents, manual runs, assessments and event hoppers, a form of identity coreference for events. Source data is Chinese, English, and Spanish newswire and discussion forum text collected by LDC.

TAC KBP Event Argument – Comprehensive Training and Evaluation Data 2016-2017 is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

Thursday, October 15, 2020

LDC 2020 October Newsletter

Fall 2020 Data Scholarship Recipients 

Membership Year 2021 Publication Preview 

LDC data and commercial technology development  
 
New Publications: 
Global TIMIT Learner Treebank English 
Corpus of Law, Academic, and News 
IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b 

__________________________________________________________________

 
Fall 2020 data scholarship recipients 
Congratulations to the recipients of LDC's Fall 2020 data scholarships: 
 
Nicole Dodd: University of California, Davis (USA); MA, Linguistics. Nicole is awarded a copy of Arabic Treebank Part 3 v. 3.2 LDC2010T08 for her research in relative clause processing in Standard Arabic. 
 
Satwik Dutta: University of Texas at Dallas (USA); PhD, Electrical Engineering. Satwik is awarded copies of The CMU Kids Corpus LDC97263 and CLSU: Kids’ Speech Version 1.1. LDC2007S18 for his work in speech activity detection.  
 
Pedram Hosseini: George Washington University (USA); PhD., Computer Science. Pedram is awarded copies of Penn Discourse Treebank Version 3.0 LDC2019T05 and The New York Times Annotated Corpus LDC2008T19 for his research in automatic detection of causal relations in text.  
 
Mariano Maisonnave: Universidad Nacional del Sur (Argentina); PhD, Computer Science. Mariano is awarded a copy of ACE 2005 Multilingual Training Corpus LDC2006T06 for his work in event extraction.  
 
Mark Sullivan: California State University, Los Angeles (USA); Masters, Applied and Advanced Studies in Education. Mark is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for his research in sentence boundary problems of Chinese L1 speakers in English compositions.  
 
For information about the program, visit the Data Scholarships page. 
 
Membership Year 2021 publication preview 
The 2021 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are: 

  • Global TIMIT Mandarin Chinese: 6,000 linguistically rich utterances featuring time-aligned lexical and phonetic transcription 
  • Columbia Games Corpus: 12 spontaneous task-oriented dyadic conversations elicited from native Standard American English speakers playing computer games, transcribed and annotated for discourse/pragmatic phenomena 
  • My Science Tutor Children’s Conversational Speech: 400+ hours of speech from 1,371 US third, fourth, and fifth grade students participating in sessions with a virtual science tutor, transcripts included 
  • The SSNCE Database of Tamil Dysarthric Speech: Tamil speech from 20 dysarthric speakers aged 12-40 years and a control group (10 speakers) with time-aligned phonetic transcripts 
  • Icelandic Parliamentary Speech: 6,493 Icelandic Parliament recordings from 2005-2016 with 196 speakers, aligned and segmented and divided into training, development, and evaluation sets for ASR development 
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources, and related tools (Akan, Kinyarwanda, and Wolof) 
  • BOLT: co-reference, treebank, propbank, and translation resources for discussion forum, SMS/Chat, and conversational telephone speech data in all languages (Chinese, Egyptian Arabic, and English) 
  • TAC KBP: training and evaluation data for English surprise slot filling (2010) and English sentiment slot filling (2013-2014) tasks  

Check your inbox in the coming weeks for more information about membership renewal.  
 
LDC data and commercial technology development 
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information. 
 
New publications: 

(1) Global TIMIT Learner Treebank English was developed by LDC and LAIX Inc. and consists of approximately 24 hours of L1 and L2 English read speech and transcripts. It is comprised of two separate data sets of 50 speakers reading 120 sentences from Treebank-3 (LDC99T42). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types.  
 
L1 English Treebank was recorded at the University of Pennsylvania, USA; participants were 25 female and 25 male native American English speakers. L2 English Treebank was recorded at LAIX Inc., Shanghai, China. L2 speakers (25 female, 25 male) were Chinese learners of English considered fluent and who had passed specified standards on English assessment tests.       
 
The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, those features include: 

  • A large number of fluently-read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns
  • A relatively large number of speakers 
  • Time-aligned lexical and phonetic transcription of all utterances 
  • Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker 

Global TIMIT Learner Treebank English is distributed via web download.   
 
2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 


* 


(2) Corpus of Law, Academic and News consists of 400 Persian documents divided into three genres: legal, academic, and news. The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020. 
 
Each document contains metadata in the file's header with information such as specific text type, dates, and source, and also contains annotations marking title and body paragraphs.  
 
Corpus of Law, Academic and News is distributed via web download. 
 
2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee 


* 


(3) IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Halh Mongolian conversational and scripted telephone speech collected in 2014 along with corresponding transcripts. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 61 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. 

  

The Babel program focused on underserved languages and sought to develop speech recognition technology that could be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.   

  

This is the last release in the IARPA Babel series which consists of 25 language packs in total. 

  

IARPA Babel Mongolian Language Pack IARPA-babel-401b-v2.0b is distributed via web download.  


2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee