Showing posts with label English Conversational Telephone Speech. Show all posts
Showing posts with label English Conversational Telephone Speech. Show all posts

Wednesday, November 15, 2023

LDC November 2023 Newsletter

Join LDC for Membership Year 2024 

Spring 2024 data scholarship application deadline

New publications:

REMIX Telephone Collection

News Sub-domain Named Entity Recognition

___________________________________________________________________


Join LDC for Membership Year 2024 

It’s time to renew your LDC membership for 2024. Current (2023) members who renew their membership before March 1, 2024 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 940+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for 2024 publications are in progress. Among the expected releases are: 

  • KASET: 147 hours of Sorani Kurdish and Kurmanji Kurdish conversational telephone speech and web broadcasts, 65 hours transcribed 
  • AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, Ukrainian, English, Spanish) for information and entity extraction 
  • RATS Low Speech Density Data: 87 hours of Levantine Arabic, English, Persian, Pushto, and Urdu audio files selected from RATS speech activity detection and keyword spotting data sets, also including communications systems sounds and silence
  • Call My Net 1: 364 hours of conversational telephone speech recordings in Tagalog, Cebuano, Cantonese and Mandarin from speakers in the Philippines and China using various handsets under diverse noise conditions 
  • Ravnursson Faroese Speech and Transcripts: 109 hours of read speech from 433 native speakers with transcripts 
  • Diaspora Tibetan Speech: elicited, read and spontaneous speech from 73 native Tibetan speakers in Katmandu’s diaspora Tibetan community, some recordings transcribed
  • IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Bulgarian, Somali, Georgian)
  • LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Farsi, Hungarian, Hindi, Amharic) 
For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2024 data scholarship application deadline

Applications are now being accepted through January 15, 2024 for the Spring 2024 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

New publications:
 
REMIX Telephone Collection was developed by LDC and contains 320 hours of English conversational telephone speech from 358 speakers who had completed all tasks in one of the previous LDC Mixer collections, specifically, Mixers 4-7. The data was collected in 2012; recordings in this corpus were used to support the NIST 2012 Speaker Recognition Evaluation. Speakers completed up to 12 calls lasting up to 10 minutes conversing on suggested topics. They were asked that half of the calls be made in a "noisy" environment, e.g., from a speakerphone, a busy street, noisy store or office, or a room with loud background noise. Speaker metadata is included. 

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*
News Sub-domain Named Entity Recognition was developed at the University of Pennsylvania and contains over 20,000 English news sentences annotated with named entities and categorized into sub-domains. The sentences were extracted from The New York Times Annotated Corpus (LDC2008T19). Named entity annotation was based on the CoNLL-2003 guidelines and annotation scheme. Sentences were labeled with person (PER), location (LOC) and organization (ORG) tags using phrase matching with a manual second pass. Sub-domains are: Arts (+Weekend/Cultural), Business (+Financial), Classifieds (+Obituary), Editorial, Foreign, Metropolitan, Sports and Others. "Others" includes topics such as Real Estate, New Jersey Weekly, Book Review, Job Market, Science, and Health & Fitness.

2023 members can access this corpus through their LDC accounts provided they have submitted a signed copy of the special license agreement. Non-members may license this data for a fee.

Tuesday, December 15, 2020

LDC 2020 December Newsletter

LDC 2021 Membership Discounts Now Available 
Approaching Deadline for Spring 2021 Data Scholarship Applications
LDC Closed for Winter Break Dec. 24- Jan. 5

New Publications:
BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech 
Phonemes of Arabic
Global TIMIT Mandarin Chinese – Guanzhong Dialect

_______________________________________________________________________

LDC 2021 Membership Discounts Now Available
Now through March 1, 2021, current 2020 members receive a 10% discount for renewing their membership and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

Approaching Deadline for Spring 2021 Data Scholarship Applications 
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2021 data scholarships are due January 15, 2021. For more information on requirements and program rules, see LDC Data Scholarships.

LDC Closed for Winter Break Dec. 24-Jan. 5
LDC will be closed from Thursday, December 24, 2020 through Tuesday, January 5, 2021 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 6, 2021. Requests received by the Membership Office during Winter Break will be processed when the office reopens.  


New publications:
(1) BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies for the BOLT co-reference task and consists of co-reference annotation on English discussion forum, SMS/Chat, and conversational telephone speech. 

Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation and covers noun phrases (including proper nouns, nominals, pronouns, and null arguments), possessives, proper noun pre-modifiers, and verbs. 

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Phonemes of Arabic was developed at the Florida Institute of Technology. It contains approximately one hour of speech from native Arabic speakers that includes all Arabic sounds (consonants and vowels) and 24 words with specific consonant-vowel patterns. 

 

Arabic has three short vowels, three long vowels, and 28 consonants. Speakers recorded all sounds, repeating each sound three times. Each speaker also recorded 24 Arabic words with a specified consonant-vowel pattern and repeated each word three times. The speakers (19 male) were from the following countries: Egypt, Iraq, Lebanon, Libya, Morocco, Saudi Arabia, and Syria.


Phonemes of Arabic is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(3) Global TIMIT Mandarin Chinese – Guanzhong Dialectwas developed by LDC and Xi’an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province. It is comprised of 50 speakers reading 120 sentences from Chinese Gigaword Fifth Edition (LDC2011T13). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types. 

The corpus was recorded at Xi’an Jiaotong University, Xi’an, China. Speakers (25 female, 25 male) were born in Weinan, Shannxi and spoke the Guanzhong dialect. 

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT data set, which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, those features include:

  • A large number of fluently read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns
  • A relatively large number of speakers
  • Time-aligned lexical and phonetic transcription of all utterances
  • Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker


Global TIMIT Mandarin Chinese – Guanzhong Dialect is distributed via web download.  

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.