Linguistic Data Consortium: Arabic Treebank

Showing posts with label Arabic Treebank. Show all posts

Wednesday, February 15, 2023

LDC February 2023 Newsletter

LDC membership discounts expire March 1

30th Anniversary Highlight: Arabic Treebank

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – Audio-Visual

LORELEI Tagalog Representative Language Pack

_________________________________________________________________________

LDC membership discounts expire March 1

Time is running out to save on 2023 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

30th Anniversary Highlight: Arabic Treebank

The Penn/LDC Arabic Treebank (ATB) project began in 2001 with support from the DARPA TIDES program and later, the DARPA GALE and BOLT programs. The original focus was on Modern Standard Arabic (MSA), not natively spoken and not homogenously acquired across its writing and reading community. In addition to the expected issues associated with complex data annotation, LDC encountered several challenges unique to a highly inflected language with a rich history of traditional grammar. LDC relied on traditional Arabic grammar, as well as established and modern grammatical theories of MSA -- in combination with the Penn Treebank approach to syntactic annotation -- to design an annotation system for Arabic. (Maamouri, et al., 2004). LDC was innovative with respect to traditional grammar when necessary and when other syntactic approaches were found to account for the data. LDC also developed a wide-coverage MSA morphological analyzer, LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01), which greatly benefited ATB development. Revisions to the annotation guidelines during the DARPA GALE program (principally related to tokenization and syntactic annotation) improved inter-annotator agreement and parsing scores.

ATB corpora were annotated for morphology, part-of-speech, gloss, and syntactic structure. Data sets based on MSA newswire developed under the revised annotation guidelines include Arabic Treebank: Part 1 v 4.1 (LDC2010T13), Arabic Treebank: Part 2 v 3.1 (LDC0211T09) and Arabic Treebank: Part 3 v 3.2 (LDC2010T08). Other genres are represented in Arabic Treebank – Broadcast News v 1.0 (LDC2012T07) and Arabic Treebank – Weblog (LDC2016T02).

LDC’s later work on Egyptian Arabic treebanks in the DARPA BOLT program benefited from the strides in its MSA treebank annotation pipeline. As for the challenges presented by informal, dialectal material, collaborator Columbia University provided a normalized Arabic orthography to account for instances of Romanized script (Arabizi) in the data and developed a morphological analyzer (CALIMA) in parallel, working in a tight feedback loop with LDC’s annotation team. SAMA and CALIMA were synchronized in the Egyptian Arabic treebanks, the former used for MSA tokens and the latter used for Egyptian Arabic tokens. Resulting corpora include BOLT Egyptian Arabic Treebank – Discussion Forum (LDC2018T23), Conversational Telephone Speech (LDC2021T12), and SMS/Chat (LDC2021T17).

ATB corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data.

New publications:

2019 NIST Speaker Recognition Evaluation Test Set – Audio-Visual contains approximately 64 hours of English audio-visual data for development and test, answer keys, enrollment, trial files and documentation from the NIST-sponsored 2019 Speaker Recognition Evaluation (SRE).

The 2019 evaluation task was speaker detection, that is, to determine whether a specified target speaker was speaking during a segment of speech. The evaluation was conducted in two parts: (1) a leaderboard-style challenge based on conversational telephone speech and (2) a separate evaluation using audio-visual data. This release relates to the audio-visual evaluation.

The source audio-visual data was collected by LDC for the VAST (Video Annotation for Speech Technology) project. That collection focused on amateur video recordings from various online media hosting services. The recordings vary in duration from 17.5 seconds to 13 minutes; most have two audio channels (stereo), but some are monophonic (one channel).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

LORELEI Tagalog Representative Language Pack was developed by LDC and is comprised of approximately 4.8 million words of Tagalog monolingual text, 341,000 words of found Tagalog-English parallel text, and 124,000 Tagalog words translated from English data. Approximately 78,000 words were annotated for named entities and over 26,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Tuesday, June 15, 2021

LDC June 2021 Newsletter

LDC data and commercial technology development

New Publications:
MyST Children’s Conversational Speech
BOLT Egyptian Arabic Treebank – Conversational Telephone Speech

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:
(1) MyST Children’s Conversational Speech was developed by Boulder Learning Inc. It contains 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. Data was collected in two phases between 2008 and 2017. Spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System, a research-based science curriculum for grades K-8. Students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers.

Data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. Data is divided into development, test, and train partitions for use with ASR systems.

MyST Children’s Conversational Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic Treebank – Conversational Telephone Speech was developed by LDC and consists of Egyptian Arabic conversational telephone speech data with part-of-speech annotation, morphology, gloss, and syntactic tree annotation.

This release contains 153,171 tokens before clitics were split and 182,965 tree tokens after clitics were split for treebank annotation. The source data was selected from conversational telephone speech collected by LDC for the CALLHOME project that was transcribed and segmented into sentence units.

Annotations follow Penn Arabic Treebank guidelines which consist of: (a) part-of-speech tagging that divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss; and (b) Arabic treebanking, which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, and so on.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT Egyptian Arabic Treebank – Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, January 21, 2016

LDC 2016 January Newsletter

CFP for LREC 2016 Novel Incentives Workshop

LDC Membership Discounts for MY 2016 Still Available

New publications:

Arabic Treebank - Weblog

NewSoMe Corpus of Opinion in Blogs

GALE Phase 4 Chinese Weblog Parallel Sentences

______________________________________________________________

CFP for LREC 2016 Novel Incentives Workshop

The first workshop on novel incentives in linguistic data collection will take place on May 28, 2016 in conjunction with the Tenth International Conference on Language Resources and Evaluation (LREC2016) in Portoroz, Slovenia.

Novel Incentives for Collecting Linguistic Data and Annotation from People: types, implementation, tasking requirements, workflow and results, opens the discussion on incentives in data collection describing novel approaches and comparing traditional monetary incentives.

The workshop is accepting papers through February 6, 2016. For more information visit the workshop webpage.

LDC Membership Discounts for MY 2016 Still Available

If you are considering joining LDC for Membership Year 2016 (MY2016), there is still time to save on membership fees. Any organization which joins or renews membership for 2016 through March 1, 2016, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2015 can receive a 10% discount on fees provided they renew prior to March 1, 2016. Publications planned for release in 2016 include multilingual language packs, BOLT discussion forum and DEFT narrative text corpora, HAVIC video clips and transcripts and the latest Arabic and Chinese treebanks.

New publications

(1) Arabic Treebank - Weblog was developed by LDC and consists of Arabic weblog data with part-of-speech, morphology, gloss and syntactic tree annotation.

The ongoing Penn Arabic Treebank Project (PATB) supports research in Arabic-language natural language processing and human language technology development. Generally, the PATB consists of two distinct phases: (a) part-of-speech (POS) tagging, which divides the text into lexical tokens and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic treebanking, which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces and so on.

The data contains 243,117 source tokens before clitics were split, and 308,996 tree tokens after clitics were separated for treebank annotation. The source material is weblogs collected by LDC from various sources.

Arabic Treebank - Weblog is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) NewSoMe Corpus of Opinion in Blogs was compiled at Barcelona Media and consists of English and Spanish blogs annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.

LDC has also released NewSoMe Corpus of Opinion in News Reports (LDC2015T17).

The data consists of 108 English documents and 191 Spanish documents. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.

NewSoMe Corpus of Opinion in Blogs is distributed via web download.

(3) GALE Phase 4 Chinese Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from newsgroup and weblog data collected by LDC and translated by LDC or under its direction.

GALE Phase 4 Chinese Weblog Parallel Sentences includes 231 source-translation document pairs, comprising 92,501 tokens of Chinese source text and its English translation.

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 4 Chinese Weblog Parallel Sentences is distributed via web download.