Linguistic Data Consortium: Egyptian Arabic SMS/Chat

Showing posts with label Egyptian Arabic SMS/Chat. Show all posts

Tuesday, November 15, 2022

LDC November 2022 Newsletter

Join LDC for membership year 2023

Fall 2022 data scholarship recipients

Spring 2023 data scholarship application deadline

30th Anniversary Highlight: CALLFRIEND

New publications:

BOLT English Translation Treebank – Egyptian Arabic SMS/Chat

Samrómur Children Icelandic Speech 1.0

Third DIHARD Challenge Development

_____________________________________________________________

Join LDC for membership year 2023

It’s time to renew your LDC membership for 2023. Current (2022) members who renew their membership before March 1, 2023 will receive a 10% discount. New or returning organizations will receive a 5% discount if they join the Consortium by March 1.

In addition to receiving new publications, current LDC members enjoy the benefit of licensing older data from our Catalog of 900+ holdings at reduced fees. Current-year for-profit members may use most data for commercial applications.

Plans for 2023 publications are in progress. Among the expected releases are:

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 hours of Ukrainian conversational telephone speech and broadcast news with 1.2 million words of corresponding orthographic transcripts
2019 NIST SRE: audiovisual and leaderboard challenge sets based on amateur videos and Tunisian Arabic telephone speech, respectively
DEFT English ERE: English text from assorted genres annotated for entities, relations and events
Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, plus interviews and transcript readings)
CALLFRIEND Russian: 100 telephone conversations among native speakers, transcripts and a lexicon, released in separate speech and text data sets
REMIX Telephone Collection: English telephone speech from 385 participants in previous Mixer studies
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, Tamil, Zulu)

For full descriptions of all LDC data sets, browse our Catalog. Visit Join LDC for details on membership, user accounts and payment.

Fall 2022 LDC data scholarship recipients

LDC congratulates the following Fall 2022 data scholarship recipients:

Nelson Filipe Costa: Concordia University (Canada); PhD, Machine Learning. Nelson is awarded a copy of Penn Discourse Treebank Version 3.0 (LDC2019T05) for his work in discourse relationships and mapping.
Paul Pope: University of Eastern Finland (Finland); MA, Linguistic Data Sciences. Paul is awarded a copy of ETS Corpus of Non-Native Written English (LDC2014T06) for his research on text classification.
Abhinav Singh: Sharda University (India); PhD, Forensic Science. Abhinav is awarded a copy of TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) for his research on forensic speech recognition.
Lucas Zheng: Deerfield Academy (USA); High School Scholar. Lucas is awarded copies of Arabic Treebank Part 1 v. 4.1 (LDC2010T13) and Arabic Treebank Part 2 v. 3.1 (LDC2011T09) for his work on analyzing syntactic and lexical similarities across MSA genres and POS-tagging for MSA.
Students can learn more about the LDC data scholarship program on the Data Scholarships page.

Spring 2023 data scholarship application deadline

Applications are now being accepted through January 15, 2023 for the Spring 2023 LDC data scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarships page for more information about program rules and submission requirements.

30th Anniversary Highlight: CALLFRIEND

The CALLFRIEND series is a multi-language collection of unscripted telephone conversations conducted by LDC in the 1990s to support language identification technology development (Liberman & Cieri, 1998). Covered languages are American English, Canadian French, Egyptian Arabic, Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. For English, Mandarin and Spanish, the collection includes two distinct dialects. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America.

This speech data was the foundation for NIST’s Language Recognition Evaluations conducted from 1996-2007. The first editions of the CALLFRIEND series published in LDC’s Catalog in 1996 contain 60 calls evenly split into 20 calls each for a training partition to develop language models, a development partition for parameter tuning, and an evaluation partition to test performance (Torres-Carrasquillo, et al., 2004).

Beginning in 2014, LDC released second editions for American English (LDC2019S21, LDC2020S08), Canadian French (LDC2019S18), Egyptian Arabic (LDC2019S04), Farsi (LDC2014S01), and Mandarin Chinese (LDC2018S09, LDC2020S06). The goal of the second editions is to facilitate continued widespread use of the data, specifically, by updating the audio files to .wav format, simplifying the directory structure, adding documentation and metadata, and combining the training, development and evaluation splits. CALLFRIEND Farsi Second Edition also includes additional telephone recordings and a separate transcripts release (LDC2014T01).

In addition to work on language identification, CALLFRIEND corpora have been used in a variety of research tasks, including subject omission in Korean (Lee 2012), contemporary Persian vowels in casual speech (Jones 2019), Mandarin telephone closings among familiars (Huang, 2020), and adjective constructions in English conversation (Bybee & Thompson, 2021), among many others.

To learn more about the CALLFRIEND collection or about other LDC corpora used for language identification research, search the Catalog by the “recommended application” and select “language identification” from the list.

New publications:

BOLT English Translation Treebank – Egyptian Arabic SMS/Chat was developed by LDC and consists of SMS and chat text data (472 files representing 98,206 tokens) translated from Egyptian Arabic to English and annotated for part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release. Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included in the corpus documentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Samrómur Children Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 131 hours of Icelandic prompted speech from 3,175 speakers (children, aged 4-17 years) representing 137,597 utterances.

Speech data was collected between October 2019 and September 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question or a demand. Prompts and speaker metadata are included in the corpus.

2022 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Third DIHARD Challenge Development was developed by LDC and contains approximately 34 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.

The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Friday, December 6, 2019

LDC 2019 December Newsletter

LDC Membership Discounts for MY2020 Still Available
Spring 2020 Data Scholarship Program – deadline approaching
Introducing LanguageArc: A Citizen Linguist Portal

New Publications:
MagicData Chinese Mandarin Conversational Speech
BOLT Egyptian Arabic-EnglishWord Alignment -- SMS/Chat Training
TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017
__________________________________________________________

LDC Membership Discounts for MY2020 Still Available

Join LDC while membership savings are still available. Now through March 2, 2020, current MY2019 members who renew their LDC membership receive a 10% discount off the membership fee. New or returning member organizations receive a 5% discount through March 2. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

Spring 2020 Data Scholarship Program – deadline approaching

Students can apply for the Spring 2020 Data Scholarship Program now through January 15, 2020. The LDC Data Scholarship program provides students with no-cost access to LDC data. For more information on application requirements and program rules, please visit LDC Data Scholarships.

Introducing LanguageArc: A Citizen Linguist Portal

LanguageARC is a citizen science website for languages developed with a grant from the National Science Foundation (no. 170377). Contributors to this online community – “citizen linguists” – participate in a variety of tasks and activities that support linguistic research, such as identifying accents from audio clips, recording “tongue twisters,” and translating English sentences into other languages. Data collected from LanguageArc will be made freely available to the research community. New collection and annotation projects will be added on an ongoing basis, and researchers will soon be able to create their own LanugageArc projects with an easy-to-use Project Builder Toolkit. All are encouraged to explore the site and participate in the community. Comments, questions and suggestions are welcome via the site’s Contact page.
___________________________________________________________

New publications:

(1) Magic Data Chinese Mandarin Conversational Speech was developed by Beijing Magic Data Technology Co., Ltd. and consists of approximately 10 hours of Mandarin conversational speech from 60 speakers. Each conversation was recorded on multiple devices and is presented in multiple forms, resulting in a total of approximately 60 hours of audio with corresponding transcripts.

All participants were native speakers of Mandarin in Mainland China from accent regions across the country. Speakers were paired for conversations on a range of topics, including travel, fitness, games, sports and pets. Metadata such as topic, collection date, mobile device and speaker demographic information is available in the documentation accompanying this release.

Magic Data Chinese Mandarin Conversational Speech is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training was developed by LDC and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

This release contains Egyptian Arabic source text message and chat conversations collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants. The source data is released as BOLT Egyptian Arabic SMS/Chat and Transliteration (LDC2017T07).

The BOLT word alignment task was built on treebank annotation. Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC’s BOLT Egyptian Arabic Treebank, which had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2016 and 2017. This corpus includes queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information for each of the queries. The EDL reference KB, to which EDL data are linked, is available separately in TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 (LDC2019T02).

The goal of the EDL track is to conduct end-to-end entity extraction, linking and clustering. For producing gold standard data, given a document collection, annotators (1) extract (identify and classify) entity mentions (queries), link them to nodes in a reference KB and (2) perform cross-document co-reference on within-document entity clusters that cannot be linked to the KB.

Source data for the annotations consists of Chinese, English and Spanish newswire and discussion forum text collected by LDC and is available in TAC KBP Evaluation Source Corpora 2016-2017 (LDC2019T12).

TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, April 17, 2017

LDC April 2017 Newsletter

LDC celebrates 25 years

LDC data and commercial technology development

New publications:

2010 NIST Speaker Recognition Evaluation Test Set

BOLT Egyptian Arabic SMS/Chat and Transliteration

CHiME2 Grid

_________________________________________________________________________

LDC celebrates 25 years

April 2017 marks the beginning of LDC’s 25^th year as the leader in language resource development and distribution. Founded in 1992, the Consortium has grown from a data repository to a vibrant data center that creates, shares and archives language resources. The Catalog continues to grow, boasting over 700 titles in more than 90 languages. With the support of members, licensees, sponsors and collaborators, LDC has distributed over 120,000 copies of data to more than 3,500 organizations worldwide. Our heartfelt thanks for your support as we continue our mission to provide large quantities of diverse data, research program support and high quality member services.

LDC data and commercial technology development

Any organization wishing to use LDC data to develop or test products for commercialization or use LDC data in any commercial product or for any commercial purpose, must first license the data as a For-Profit Member. Once the data is licensed under the For-Profit Membership, the organization retains perpetual rights to use the data for commercial technology development. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for more information.

New Corpora

(1) 2010 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and interview speech recorded over a microphone channel used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation (SRE).

The telephone speech segments include two-channel excerpts of approximately 10 seconds and 5 minutes. There are also summed-channel excerpts in the range of 5 minutes. The microphone excerpts are 3-15 minutes in duration. As in prior evaluations, intervals of silence were not removed.

The 2010 evaluation includes not only conversational telephone speech (CTS) recorded over ordinary telephone channels for the core training and test conditions, but also CTS and conversational interview speech recorded over a room microphone channel. Unlike prior evaluations, some of the conversational telephone style speech was collected in a manner to produce particularly high, or particularly low, vocal effort on the part of the speaker of interest. In addition to evaluation data, this package also consists of answer keys, trial and train files, development data and evaluation documentation.

2010 NIST Speaker Recognition Evaluation Test Set is distributed via hard drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic SMS/Chat and Transliteration was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Egyptian Arabic. The corpus contains 5,691 conversations totaling 1,029,248 words across 262,026 messages. Messages were natively written in either Arabic orthography or romanized Arabizi. A total of 1,856 Arabizi conversations (287,022 words) were transliterated from the original romanized Arabizi script into standard Arabic orthography and then reviewed, corrected and normalized by LDC annotators according to "Conventional Orthography for Dialectal Arabic" (CODA).

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT Egyptian Arabic SMS/Chat and Transliteration is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) CHiME2 Grid was developed as part of The 2nd CHiME Speech Separation and Recognition Challenge and contains approximately 120 hours of English speech from a noisy living room environment. The CHiME Challenges focus on distant-microphone automatic speech recognition (ASR) in real-world environments.

CHiME2 Grid reflects the small vocabulary track of the CHiME2 Challenge. The target utterances were taken from the Grid corpus and consist of 34 speakers reading simple 6-word sequences. The Data is divided into training, development and test sets.

CHiME2 Grid is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.