Linguistic Data Consortium: 2023

Friday, December 15, 2023

LDC December 2023 Newsletter

LDC 2024 membership discounts now available

Approaching deadline for Spring 2024 data scholarship applications

LDC closed for Winter Break Dec. 25-Jan. 1

New publications:

Kasdi-Merbah (University) Emotional Database in Arabic Speech

TAC-KBP Belief and Sentiment – Comprehensive Training and Evaluation Data 2016-2017
______________________________________________________________

LDC 2024 membership discounts now available

Now through March 1, 2024, current 2023 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching deadline for Spring 2024 data scholarship applications

Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2024 data scholarships are due January 15, 2024. For more information on requirements and program rules, see LDC Data Scholarships.

LDC closed for Winter Break Dec. 25-Jan. 1

LDC will be closed from Monday, December 25, 2023 through Monday, January 1, 2024 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Tuesday, January 2, 2024. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:

Kasdi-Merbah (University) Emotional Database in Arabic Speech was developed by the University of Kasdi Merbah Ouargla and contains two hours of Modern Standard Arabic prompted speech from 500 speakers (254 female, 246 male) representing 5,000 utterances. Each speaker read ten sentences, with two sentences each for five different emotions (sadness, fear, anger, happiness, neutral).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

TAC-KBP Belief and Sentiment – Comprehensive Training and Evaluation Data 2016-2017 includes all training and evaluation data developed by LDC for the Belief and Sentiment tracks: source documents (Chinese, English, and Spanish newswire and discussion forums); gold standard entity, relation, and event annotation; and belief and sentiment annotation.

The goal of the TAC-KBP Belief and Sentiment track was to provide information about beliefs and sentiments held by entities toward other entities, as well as toward events and relations. The gold standard set of labeled entities, relations, and events was used to create a system for automatically labeling belief and sentiment about each possible target (entity, relation or event) and for identifying the entity holding the belief or sentiment.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Wednesday, November 15, 2023

LDC November 2023 Newsletter

Join LDC for Membership Year 2024

Spring 2024 data scholarship application deadline

New publications:

REMIX Telephone Collection

News Sub-domain Named Entity Recognition

___________________________________________________________________

Join LDC for Membership Year 2024

It’s time to renew your LDC membership for 2024. Current (2023) members who renew their membership before March 1, 2024 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 940+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for 2024 publications are in progress. Among the expected releases are:

KASET: 147 hours of Sorani Kurdish and Kurmanji Kurdish conversational telephone speech and web broadcasts, 65 hours transcribed

AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, Ukrainian, English, Spanish) for information and entity extraction

RATS Low Speech Density Data: 87 hours of Levantine Arabic, English, Persian, Pushto, and Urdu audio files selected from RATS speech activity detection and keyword spotting data sets, also including communications systems sounds and silence

Call My Net 1: 364 hours of conversational telephone speech recordings in Tagalog, Cebuano, Cantonese and Mandarin from speakers in the Philippines and China using various handsets under diverse noise conditions

Ravnursson Faroese Speech and Transcripts: 109 hours of read speech from 433 native speakers with transcripts

Diaspora Tibetan Speech: elicited, read and spontaneous speech from 73 native Tibetan speakers in Katmandu’s diaspora Tibetan community, some recordings transcribed

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Bulgarian, Somali, Georgian)

LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Farsi, Hungarian, Hindi, Amharic)

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Spring 2024 data scholarship application deadline

Applications are now being accepted through January 15, 2024 for the Spring 2024 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

New publications:

REMIX Telephone Collection was developed by LDC and contains 320 hours of English conversational telephone speech from 358 speakers who had completed all tasks in one of the previous LDC Mixer collections, specifically, Mixers 4-7. The data was collected in 2012; recordings in this corpus were used to support the NIST 2012 Speaker Recognition Evaluation. Speakers completed up to 12 calls lasting up to 10 minutes conversing on suggested topics. They were asked that half of the calls be made in a "noisy" environment, e.g., from a speakerphone, a busy street, noisy store or office, or a room with loud background noise. Speaker metadata is included.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

News Sub-domain Named Entity Recognition was developed at the University of Pennsylvania and contains over 20,000 English news sentences annotated with named entities and categorized into sub-domains. The sentences were extracted from The New York Times Annotated Corpus (LDC2008T19). Named entity annotation was based on the CoNLL-2003 guidelines and annotation scheme. Sentences were labeled with person (PER), location (LOC) and organization (ORG) tags using phrase matching with a manual second pass. Sub-domains are: Arts (+Weekend/Cultural), Business (+Financial), Classifieds (+Obituary), Editorial, Foreign, Metropolitan, Sports and Others. "Others" includes topics such as Real Estate, New Jersey Weekly, Book Review, Job Market, Science, and Health & Fitness.

2023 members can access this corpus through their LDC accounts provided they have submitted a signed copy of the special license agreement. Non-members may license this data for a fee.

Monday, October 16, 2023

LDC October 2023 Newsletter

Membership Year 2024 publication preview

Fall 2023 data scholarship recipients

New publications:

AIDA Scenario 1 Practice Topic Source Data

AIDA Scenario 1 and 2 Reference Knowledge Base

_______________________________________________________________

Membership Year 2024 publication preview
The 2024 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

KASET: 147 hours of Sorani Kurdish and Kurmanji Kurdish conversational telephone speech and web broadcasts, 65 hours transcribed

AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, Ukrainian, English, Spanish) for information and entity extraction

RATS Low Speech Density Data: 87 hours of Levantine Arabic, English, Persian, Pushto, and Urdu audio files selected from RATS speech activity detection and keyword spotting data sets, also including communications systems sounds and silence

Call My Net 1: 364 hours of conversational telephone speech recordings in Tagalog, Cebuano, Cantonese and Mandarin from speakers in the Philippines and China using various handsets under diverse noise conditions

Ravnursson Faroese Speech and Transcripts: 109 hours of read speech from 433 native speakers with transcripts

Diaspora Tibetan Speech: elicited, read and spontaneous speech from 73 native Tibetan speakers in Katmandu’s diaspora Tibetan community, some recordings transcribed

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Bulgarian, Somali, Georgian)

LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Farsi, Hungarian, Hindi, Amharic)

Check your inbox in the coming weeks for more information about membership renewal. 

Fall 2023 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2023 data scholarships:

Nessma Diab: Ain-Shams University (Egypt): Pre-PhD student, Linguistics. Nessma is awarded copies of CALLHOME Egyptian Arabic Speech LDC97S45 and CALLHOME Egyptian Arabic Transcripts LDC97T10 for her work in machine translation.

Soheir Elssakkout: Ain-Shams University (Egypt): PhD candidate. Soheir is awarded copies of Turkish Broadcast News and Transcripts LDC2012S06 and Middle East Technical University Turkish Microphone Speech v 1.0 LDC2006S33 for her work in speech recognition.

Metheus Franco: Witten/Herdecke University (Germany): Post-doctoral scholar, Faculty of Management, Economics and Society. Metheus is awarded a copy of Avocado Research Email Collection LDC2015T03 for his work in emotional foundations of dynamic capabilities.

Kamal Jarrar: Birzeit University (Palestine): Master’s student, Applied Statistics and Data Science Program. Kamal is awarded copies of Arabic Gigaword Fifth Edition LDC2011T11 and BOLT Arabic Discussion Forums LDC2018T10 for his work in part-of-speech tagging for dialectal Arabic.

Minkyoung Kim: Yonsei University (Korea); PhD candidate, Graduate School of Information. Minkyoung is awarded a copy of The New York Times Annotated Corpus LDC2018T19 for her work in event extraction and semantic event annotation.

Humaira Mehmood: Fatima Jinnah Women University (Pakistan): Master’s student, Computer Sciences. Humaira is awarded a copy of ARL Urdu Speech Database, Training Data LDC2007S03 for her work in machine translation.

Diyam Mousa: Birzeit University (Palestine): PhD candidate, Computer Science Department. Diyam is awarded copies of Arabic Treebank: Part 3 v. 3.2 LDC2010T08 and BOLT Egyptian Arabic Treebank – Discussion Forum LDC2018T23 for her work in morphological tagging for dialectal Arabic.

For information about the program, visit the Data Scholarships page.

New Publications

AIDA Scenario 1 Practice Topic Source Data was developed by LDC and is comprised of 1511 files (text, image, and video) from English, Russian, and Ukrainian web sources. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 1 scenario focused on political relations between Russia and Ukraine in the 2010s. This corpus constitutes the full set of topic-focused documents for Phase 1 practice subtopics. Data was collected from web sources by a combination of automatic and manual processes.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.
The knowledge base for entity detection and linking annotation for all AIDA Scenario 1 and 2 corpora is available separately as AIDA Scenario 1 and 2 Reference Knowledge Base (LDC2023T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

AIDA Scenario 1 and 2 Reference Knowledge Base contains the English knowledge base (KB) used for all AIDA entity linking annotation in Scenario 1 (Russia-Ukraine Relations) and Scenario 2 (Crisis in Venezuela). The KB content was drawn from GeoNames, the CIA World Leaders List and the CIA World Factbook and was supplemented with manually-created KB entries developed by LDC specifically for AIDA data.

This knowledge base supported the AIDIA entity detection and linking task for 13 entity types: GPE (Geo-Political Entity), LOC (Location), PER (Person), ORG (Organization), FAC (Facility), MHI (Medical/Health Issue), WEA (Weapon), SID (Side), COM (Commodity), CRM (Crime), LAW (Law), VEH (Vehicle), and BAL (Ballot).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Friday, September 15, 2023

LDC September 2023 Newsletter

LDC data and commercial technology development

New publications:

CALLFRIEND Russian Speech

CALLFRIEND Russian Text
________________________________________________________________

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

CALLFRIEND Russian Speech was developed by LDC and consists of 48 hours of telephone conversations (100 recordings) between native speakers of Russian. The calls were recorded in 1999 as part of the CALLFRIEND collection, a project designed primarily to support research in automatic language identification. One hundred native Russian speakers living in the continental United States each made a single phone call, lasting up to 30 minutes, to a family member or friend living in the United States.

All recordings involved domestic calls routed through LDC’s automated telephone collection platform and stored as 2-channel (4-wire) 8-KHz mu-law samples taken directly from a public telephone network via a T-1 circuit. Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data.

This release includes call metadata, including speaker gender, the number of speakers on each channel and call duration.

Corresponding transcripts and a lexicon are available in CALLFRIEND Russian Text (LDC2023T09).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

CALLFRIEND Russian Text contains the corresponding transcripts and a lexicon for CALLFRIEND Russian Speech, that is, 48 hours of telephone conversations (100 recordings) between native Russian speakers.

The transcripts have four main fields on each line (begin_offset, end_offset, speaker_label, transcript_text) separated by tabs. Each contains a list of time-stamped segments in order according to their begin_offset values, with no blank lines.

The lexicon covers the word forms in the 97 transcript files. The main lexicon table contains three columns per row: Cyrillic orthography, phonetic transliteration and numeric representation of syllabic stress.

Corresponding speech data is available as CALLFRIEND Russian Speech (LDC2023S08).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Tuesday, August 15, 2023

LDC August 2023 Newsletter

LDC at Interspeech 2023

LDC releases speech activity detector

Fall 2023 LDC Data Scholarship Program

New publications:2019 OpenSAT Public Safety Communications SimulationSamrómur Queries Icelandic Speech 1.0

__________________________________________________________________________

LDC at Interspeech 2023

LDC is happy to be back in person as an exhibitor and longtime supporter of Interspeech, taking place this year August 20-24 in Dublin, Ireland. Stop by Stand A2 to say hello and learn about the latest developments at the Consortium. LDC is also delighted to once again be a silver sponsor for the Young Female Researchers in Speech Workshop and to provide data in support of the CHiME-7 challenge satellite workshop and the MERLIon CCS Challenge. LDC will post conference updates via our social media platforms. We look forward to seeing you in Dublin! LDC releases speech activity detector

LDC announces the release of the LDC Broad Phonetic Class Speech Activity Detector. Based on the broad phonetic class recognizer implemented in the HTK Speech Recognition Toolkit, LDC’s speech activity detector model runs the speech signal through a GMM-HMM recognizer to identify five broad phonetic classes: vowel, stops/affricate, fricative, nasal, and glide/liquid. The LDC Broad Phonetic Class Speech Activity Detector is available at no cost on github under a GPL v3 license.

Fall 2023 LDC Data Scholarship Program
Student applications for the Fall 2023 LDC Data Scholarship program are being accepted now through September 15, 2023. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor. For application requirements and program rules, visit the LDC Data Scholarships page.

New publications:

2019 OpenSAT Public Safety Communications Simulation contains 141 hours of English speech recordings and transcripts used in the NIST Open Speech Analytic Technologies (OpenSAT) 2019 evaluation's automatic speech recognition, speech activity detection, and keyword search tasks. The data is part of the SAFE-T (Speech Analysis For Emergency Response Technology) corpus created by LDC which is comprised of speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types, and noise levels.US English speakers played the board game Flash Point Fire Rescue. Background noise was played through a participant's headset during the recording session. Recording sessions consisted of 2 30-minute games. The corpus is divided into training, development, and evaluation data. 2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Samrómur Queries Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 20 hours of Icelandic prompted queries from 3,809 speakers representing 17,475 utterances.

Speech data was collected between October 2019 and December 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.2023 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.