Linguistic Data Consortium: BOLT Egyptian Arabic

Showing posts with label BOLT Egyptian Arabic. Show all posts

Tuesday, November 15, 2022

LDC November 2022 Newsletter

Join LDC for membership year 2023

Fall 2022 data scholarship recipients

Spring 2023 data scholarship application deadline

30th Anniversary Highlight: CALLFRIEND

New publications:

BOLT English Translation Treebank – Egyptian Arabic SMS/Chat

Samrómur Children Icelandic Speech 1.0

Third DIHARD Challenge Development

_____________________________________________________________

Join LDC for membership year 2023

It’s time to renew your LDC membership for 2023. Current (2022) members who renew their membership before March 1, 2023 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 900+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for 2023 publications are in progress. Among the expected releases are:

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news with 1.2 million words of corresponding orthographic transcripts
2019 NIST SRE: audiovisual and leaderboard challenge sets based on amateur videos and Tunisian Arabic telephone speech, respectively
DEFT English ERE: English text from assorted genres annotated for entities, relations and events
Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, plus interviews and transcript readings)
CALLFRIEND Russian: 100 telephone conversations among native speakers, transcripts and a lexicon, released in separate speech and text data sets
REMIX Telephone Collection: English telephone speech from 385 participants in previous Mixer studies
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, Tamil, Zulu)

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Fall 2022 LDC data scholarship recipients

LDC congratulates the following Fall 2022 data scholarship recipients:

Nelson Filipe Costa: Concordia University (Canada); PhD, Machine Learning. Nelson is awarded a copy of Penn Discourse Treebank Version 3.0 (LDC2019T05) for his work in discourse relationships and mapping.
Paul Pope: University of Eastern Finland (Finland); MA, Linguistic Data Sciences. Paul is awarded a copy of ETS Corpus of Non-Native Written English (LDC2014T06) for his research on text classification.
Abhinav Singh: Sharda University (India); PhD, Forensic Science. Abhinav is awarded a copy of TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) for his research on forensic speech recognition.
Lucas Zheng: Deerfield Academy (USA); High School Scholar. Lucas is awarded copies of Arabic Treebank Part 1 v. 4.1 (LDC2010T13) and Arabic Treebank Part 2 v. 3.1 (LDC2011T09) for his work on analyzing syntactic and lexical similarities across MSA genres and POS-tagging for MSA.
Students can learn more about the LDC data scholarship program on the Data Scholarships page.

Spring 2023 data scholarship application deadline

Applications are now being accepted through January 15, 2023 for the Spring 2023 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

30th Anniversary Highlight: CALLFRIEND

The CALLFRIEND series is a multi-language collection of unscripted telephone conversations conducted by LDC in the 1990s to support language identification technology development (Liberman & Cieri, 1998). Covered languages are American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. For English, Mandarin and Spanish, the collection includes two distinct dialects. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America.

This speech data was the foundation for NIST’s Language Recognition Evaluations conducted from 1996-2007. The first editions of the CALLFRIEND series published in LDC’s Catalog in 1996 contain 60 calls evenly split into 20 calls each for a training partition to develop language models, a development partition for parameter tuning, and an evaluation partition to test performance (Torres-Carrasquillo, et al., 2004).

Beginning in 2014, LDC released second editions for American English (LDC2019S21, LDC2020S08), Canadian French (LDC2019S18), Egyptian Arabic (LDC2019S04), Farsi (LDC2014S01), and Mandarin Chinese (LDC2018S09, LDC2020S06). The goal of the second editions is to facilitate continued widespread use of the data, specifically, by updating the audio files to .wav format, simplifying the directory structure, adding documentation and metadata, and combining the training, development and evaluation splits. CALLFRIEND Farsi Second Edition also includes additional telephone recordings and a separate transcripts release (LDC2014T01).

In addition to work on language identification, CALLFRIEND corpora have been used in a variety of research tasks, including subject omission in Korean (Lee 2012), contemporary Persian vowels in casual speech (Jones 2019), Mandarin telephone closings among familiars (Huang, 2020), and adjective constructions in English conversation (Bybee & Thompson, 2021), among many others.

To learn more about the CALLFRIEND collection or about other LDC corpora used for language identification research, search the Catalog by the “recommended application” and select “language identification” from the list.

New publications:

BOLT English Translation Treebank – Egyptian Arabic SMS/Chat was developed by LDC and consists of SMS and chat text data (472 files representing 98,206 tokens) translated from Egyptian Arabic to English and annotated for part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release. Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included in the corpus documentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Samrómur Children Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 131 hours of Icelandic prompted speech from 3,175 speakers (children, aged 4-17 years) representing 137,597 utterances.

Speech data was collected between October 2019 and September 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.

2022 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Third DIHARD Challenge Development was developed by LDC and contains approximately 34 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.

The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Monday, November 15, 2021

LDC November 2021 Newsletter

Join LDC for Membership Year 2022

Spring 2022 Data Scholarship Application Deadline

New Publications:

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Second DIHARD Challenge Development – Eleven Sources

Second DIHARD Challenge Development - SEEDLingS

________________________________________________________________

Join LDC for Membership Year 2022

Membership Year 2022 (MY2022) is open and discounts are available for those who keep their membership current and join early. Current MY2021 members who renew their LDC membership before March 1, 2022 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount when joining by March 1.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data from our Catalog of 900 holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for MY2022 publications are in progress. Among the expected releases are:

2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation

AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13

Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names

MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts

HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task

DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data

LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof)

It’s not too late to join LDC for MY2020 (through December 31, 2021) and MY2021 (through December 31, 2022). Data sets from those years include 2018 NIST Speaker Recognition Evaluation Test Set, Mixer 4 and 5 Speech, AMR Annotation Release 3.0, Penn Parsed Corpora of Historical English, RATS Speaker Identification, BOLT Egyptian Arabic and Chinese resources (treebanks, propbanks, co-reference), Global TIMIT Mandarin Chinese, and MyST Children’s Conversational Speech.

For full descriptions of all LDC data sets, browse our Catalog.

Visit Join LDC for details on membership, user accounts and payment.

Spring 2022 Data Scholarship Application Deadline

Applications are now being accepted through January 15, 2022 for the Spring 2022 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.

New publications:

(1) BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by the University of Colorado Boulder - CLEAR (Computational Language and Education Research) for the DARPA BOLT program and consists of propbank annotation on Egyptian Arabic informal text and telephone speech.

Propbank annotation provides a layer of semantic annotation over treebank. In this release, it was applied to BOLT phrase structure treebank annotation and was carried out in two phases: (1) a frame file for each predicate was created, and (2) the predicate argument structure was annotated using the frame file as a reference.

BOLT Egyptian Arabic PropBank and Sense – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Second DIHARD Challenge Development - Eleven Sources was developed by LDC and contains approximately 22 hours of English and Chinese speech data along with corresponding annotations used in support of the Second DIHARD Challenge.

The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As with the first challenge, the second development and evaluation sets were drawn from a diverse sampling of sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, extended child language acquisition recordings, and amateur web videos.

Second DIHARD Challenge Development – Eleven Sources is distributed via web download.

(3) Second DIHARD Challenge Development - SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the Second DIHARD Challenge. The DIHARD Challenges are a set of shared tasks on diarization focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly.

Source data is from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings were generated in the home environment of infants in the Rochester, New York area. A subset of that data was annotated by LDC for use in the first and second DIHARD Challenges.

The data in this release consists of files provided in the Second DIHARD Challenge as well as subsequently updated annotated files not provided to second challenge participants.

Second DIHARD Challenge Development – SEEDLingS is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, October 14, 2021

LDC October 2021 Newsletter

Fall 2021 data scholarship recipients

Membership Year 2022 publication preview

LDC data and commercial technology development

New Publications:

UCLA Variability Speaker Database

BOLT Egyptian Arabic Treebank – SMS/Chat

_______________________________________________________

Fall 2021 data scholarship recipients

Congratulations to the recipients of LDC's Fall 2021 data scholarships:

Sophia Minnillo: University of California, Davis (USA); PhD, Linguistics. Sophia is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for her research on the use of transition markers by Chinese L1 speakers.

Jagabandhu Mishra: Indian Institute of Technology Dharwad (India); Research Scholar, Electrical Engineering. Jagabandhu is awarded a copy of Mandarin-English Code-Switching in South-East Asia LDC2015S04 for his work in spoken language diarization.

Kashyap Patel: University of Texas at Dallas (USA); Ph.D., Electrical Engineering. Kashyap is awarded copies of CSR-I (WSJ0) Sennheiser LDC93S6B and CSR-II (WSJ1) Sennheiser LDC94S13B for his research in audio, acoustic and speech signal processing.

Yoshani Ranaweera, D. Dissanayaka, S. Sudasinghe: University of Moratuwa (Sri Lanka); Bachelors, Computer Science and Engineering. This group is awarded a copy of CALLHOME American English Speech LDC97S4 for their work in speaker diarization.

Winie Wong: University of Illinois at Chicago (USA); PhD, Electrical and Computer Engineering. Winie is awarded copies of ISI Chinese-English Automatically Extracted Parallel Text LDC2007T09 and GALE Phase 3 and 4 Chinese Broadcast News Parallel Text LDC2016T15 for her research in machine translation.

For information about the program, visit the Data Scholarships page.

Membership Year 2022 publication preview

The 2022 Membership Year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation
AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13
Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names
MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts
HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task
DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof)

Check your inbox in the coming weeks for more information about membership renewal. 

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) UCLA Variability Speaker Database was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of approximately 34 hours of English speech and orthographic transcripts. Speakers (101 female, 101 male) took part in six tasks: vowel sounds, reading sentences, giving instructions, neutral conversation, happy conversation, a phone conversation, annoyed conversation, and responding to a video. This corpus was designed to sample variability in speaking within individual speakers and across a large number of speakers.

UCLA Variability Speaker Database is distributed via web download.

(2) BOLT Egyptian Arabic Treebank – SMS/Chat was developed by LDC and consists of Egyptian Arabic SMS/Chat data with part-of-speech annotation, morphology, and syntactic tree annotation. This release contains 349,414 tokens before clitics were split and 435,677 tree tokens after clitics were split for treebank annotation. The source data was collected by LDC from its collection platform or by donation and was manually reviewed to exclude material not in the target language or with sensitive content. Originally written in Arabizi (Romanized/Latin characters) script, the source SMS/chat text was transliterated to Arabic script and manually corrected prior to treebank annotation. Annotations followed Penn Arabic Treebank guidelines.

BOLT Egyptian Arabic Treebank – SMS/Chat is distributed via web download.