Linguistic Data Consortium: Egyptian Arabic

Showing posts with label Egyptian Arabic. Show all posts

Friday, March 13, 2020

LDC 2020 March Newsletter

Spring 2020 LDC Data Scholarship recipients
LDC data and commercial technology development

New Publications:
BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training

__________________________________________________________________

Spring 2020 LDC Data Scholarship recipients

LDC congratulates the following Spring 2020 Data Scholarship recipients:

Zahra Azin (Istanbul Technical University, Turkey) is awarded a copy of Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02) for her work in Turkish AMR.
Spandan Dey (IIT Kharagpur, India) is awarded a copy of Multi-Language Conversational Telephone Speech – South Asian (LDC2017S14) for his research on automatic language recognition.
Jonathan Downey (University of California, Santa Barbara, US) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his research on second language acquisition and quantitative methodologies for educational measurements.
Nathaniel Fackler (University of Georgia, US) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his work on adult second language acquisition.
B. Senthil Kumar (SSN College of Engineering & Anna University, India) is awarded a copy of 2009 CoNLL Shared Task Part 2 (LDC2012T04) for his research on semantic role labeling.
Ming Li (Colorado School of Mines, US) is awarded a copy of TIDIGITS (LDC93S10) for her research on inferring speech signals from motion data in Internet of Things (IoT) security.
Jialiang Lin (Xiamen University, China) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his project to train and test an automated essay scoring model.

Students can learn more about the LDC Data Scholarship program and the next application cycle on the Data Scholarships page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

__________________________________________________________________

New publications:

(1) BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training was developed by LDC and consists of 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of transcripts of Egyptian Arabic conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC97S45, LDC97T19, LDC2002S37, LDC2002T38, LDC96S49) that was translated into English by professional translation agencies and annotated for the word alignment task.

The BOLT word alignment task was built on treebank annotation. Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC’s BOLT Egyptian Arabic Treebank, which had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.

BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) EVALution was developed by The Hong Kong Polytechnic University. It is comprised of English and Mandarin Chinese data sets -- EVALution 1.0 and EVALution-Man, respectively -- that contain semantic relations and metadata for training and evaluating distributional semantic models.

EVALution 1.0 consists of approximately 7500 English tuples extracted from ConceptNet 5.0 and WordNet 4.0 and filtered through automatic methods and crowd-sourcing. Several semantic relations between word pairs were instantiated, including hypernymy, synonymy, antonymy and meronymy. The corpus also includes additional information that can be used to filter the pairs or to analyze the results, such as relation domain, word frequency, word part-of-speech and word semantic field.

EVALution-MAN consists of Chinese word pairs from two sources: Chinese Wordnet and humans who completed an elicitation task by supplying missing words to sentences. The human-supplied sentence word pairs were then judged by human raters for reliability.

EVALution is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Mixer 4 and 5 Speech was developed by LDC and contains approximately 14,185 hours of audio recordings of conversational telephone speech, interviews, elicitation exercises and transcript readings involving 616 distinct speakers. The material was collected in 2007 as part of the Mixer project – which supported speaker recognition for a variety of research tasks – and recordings in this corpus were used in the 2008 NIST Speaker Recognition Evaluation.

The data in this release was collected by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley, as a collaborative, carefully coordinated activity at both recording sites. The Mixer 4 and 5 collection contains 2,568 recordings made via the public telephone network and 2,152 sessions of multiple microphone recordings in office-room settings.

The telephone protocol connected recruited speakers through a robot operator to carry on casual conversations. In Mixer 4, 400 subjects made ten 10-minute calls; half of those subjects also visited one of the collection sites where they made two telephone calls while also being recorded on a cross-channel platform. In Mixer 5, 300 subjects each completed ten calls and six interview sessions at either LDC or ICSI; those sessions were conducted on a cross channel platform and included a telephone call in one of three vocal-effort conditions - normal, high and low. Mixer participants were nearly all native English speakers, the rest being bilingual English speakers.

This release includes metadata about the calls and speakers, along with time-aligned entries for many of the component portions of the recording sessions.

Mixer 4 and 5 Speech is distributed via hard drive.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Monday, April 15, 2019

LDC 2019 April Newsletter

LDC at ICASSP 2019

LDC data and commercial technology development

New Publications:
BOLT Egyptian-English Word Alignment -- Discussion Forum Training
Chinese Abstract Meaning Representation 1.0
HAVIC MED Progress Test -- Videos, Metadata and Annotation ____________________________________________________________

LDC at ICASSP 2019
LDC will be exhibiting at ICASSP 2019, held this year May 12-17 in Brighton, UK. Stop by booth 5 to learn more about recent developments at the Consortium and new publications.

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) BOLT Egyptian-English Word Alignment -- Discussion Forum Training was developed by LDC and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes and is released as BOLT Arabic Discussion Forums (LDC2018T10).

The BOLT word alignment task was built on treebank annotation. Egyptian source tree tokens for word alignment were automatically extracted from tree files of BOLT Egyptian Arabic Treebank annotation on the discussion forum data. Human annotators then followed LDC guidelines to link words and phrases in Arabic to those in English.

BOLT Egyptian-English Word Alignment -- Discussion Forum Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Chinese Abstract Meaning Representation 1.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of Chinese sentences from the weblog and discussion forum portions of Chinese Treebank 8.0 (LDC2013T21). Annotations were applied to 10,149 sentences, with 176 sentences unannotated.

Abstract Meaning Representation (AMR) captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree structure. Chinese AMR is based on the annotation methodology developed for English with adaptations for handling specific Chinese phenomena. The goal of the Chinese AMR project is to create a large aligned AMR corpus, of which this data set is the first release. For more information about the project, see the Chinese AMR homepage.

Chinese Abstract Meaning Representation 1.0 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) HAVIC MED Progress Test -- Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 3,650 hours of user-generated videos with annotation and metadata.

In a collaboration with NIST (the National Institute of Standards and Technology) to advance multimodal event detection and related technologies, LDC developed a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Progress Test is a subset of that corpus, specifically, a collection of event and background videos originally released to support the 2012-2015 MED tasks.

This release consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Progress Test -- Videos, Metadata and Annotation is distributed via hard drive.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Friday, March 15, 2019

LDC 2019 March Newsletter

Call for Papers - LTC 2019, LREC 2020

New Publications:

CALLFRIEND Egyptian Arabic Second Edition

Penn Discourse Treebank Version 3.0
VAST Chinese Speech and Transcripts

___________________________________________________________

Call for Papers

The 9^th Language & Technology Conference (LTC 2019) will take place on May 17-19, 2019 at the Adam Mickiewicz University in Poznań, Poland. LTC addresses Human Language Technologies as a challenge for computer science, linguistics and related fields. Conference papers are due next week on Wednesday, March 20, 2019 (midnight, any time zone). For more information, visit the conference webpage.

The 12^th Conference on Language Resources and Evaluation (LREC 2020) will take place on May 13-15, 2020 at the Palais du Pharo in Marseille, France. LREC aims to provide an overview of the state-of-the-art, explore new R&D directions and emerging trends, and exchange information regarding language resources and their applications, evaluation methodologies and tools. Conference papers are due by November 25, 2019. For more information, including conference topics, visit the conference webpage.

New Publications:

(1) CALLFRIEND Egyptian Arabic Second Edition was developed by LDC and consists of approximately 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Egyptian Arabic (LDC96S49).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Egyptian Arabic Second Edition is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Penn Discourse Treebank Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains over 40,600 tokens of annotated relations. In Version 3, an additional 13,000 tokens were annotated, certain pairwise annotations were standardized, new senses were included and the corpus was subject to a series of consistency checks.

This corpus contains two tools: (1) The Annotator, used for annotation and adjudication, and which can also be used for viewing the corpus; and (2) The Conversion Tool for converting Version 2 annotation files into the Version 3 format.

The documentation directory contains a manual describing what is new in Version 3 and how Version 3 differs from Version 2; the methods and guidelines used in annotating PDTB Version 3; and a range of statistics on the tokens, including the frequency of each connective, its sense labels and its modifiers. More information about the corpus and research carried out by the developers and others using the corpus can be found on the PDTB website.

Penn Discourse Treebank Version 3.0 is distributed via web download.

(3) VAST Chinese Speech and Transcripts was developed by LDC for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the web and corresponding time-aligned transcripts.

Audio files were transcribed using XTrans, which supports manual transcription across multiple channels, languages and platforms. Transcribers followed a Quick-Rich Transcription style; transcription guidelines are included in this release.

The aim of the VAST project was to collect and annotate data in several languages to support the development of speech technologies such as speech activity detection, language identification, speaker identification, and speech recognition.

VAST Chinese Speech and Transcripts is distributed via web download.