Linguistic Data Consortium: Pashto

Wednesday, September 15, 2021

LDC September 2021 Newsletter

New Publications:

DiscAlign for Penn and RST Discourse Treebanks

_________________________________________________________________

New publications:

(1) RATS Speaker Identification was developed by LDC and is comprised of approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotations of speech segments. The audio was retransmitted over eight channels, for 17,000 hours of total speech. The corpus was created to provide training and development sets for the speaker identification task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings collected by LDC specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, speaker ID, speaker ID provenance, language ID, and language ID provenance.

RATS Speaker Identification is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Classical Arabic Dictionary consists of approximately one hundred million words of Arabic collected from texts dating between 431 and 1104 CE, principally books and essays, along with word occurrences, source documents and related metadata.

The dictionary is presented in three formats: plain text in UTF-8 encoding, plain text in CP1256 encoding, and as an SQL database file. Source documents are presented in UTF-8 and CP1256 encodings.

Classical Arabic Dictionary is distributed via web download.

(3) DiscAlign for Penn and RST Discourse Treebanks was developed by Saarland University. It consists of alignment information for the discourse annotations contained in Penn Discourse Treebank Version 2.0 (LDC2008T05) (PDTB 2.0) and RST Discourse Treebank (LDC2002T07) (RST-DT). PDTB 2.0 and RST-DT annotations overlap for 385 newspaper articles in sections 6, 11, 13, 19 and 23 of the Wall Street Journal corpus contained in Treebank-2 (LDC95T7). DiscAlign for Penn and RST Discourse Treebanks contains approximately 6,700 alignments between PDTB 2.0 and RST-DT relations.

DiscAlign for Penn and RST Treebanks is available at no cost to all licensees of PDTB 2.0 and RST-DT and appears in their download queues associated with these corpora as DiscAlign_Penn_RST_DTB_LDC2021T16.zip.

Thursday, September 15, 2016

LDC September 2016 Newsletter

New publications:

ARL Arabic Dependency Treebank

BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training

IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY

GALE Phase 4 Arabic Broadcast News Parallel Sentences

_________________________________________________________________________

New Corpora

(1) ARL Arabic Dependency Treebank was developed by the US Army Research Laboratory (ARL) and was derived from four LDC resources: Arabic Treebank (ATB) Part 1 v4.1 (LDC2010T13), Part 2 v3.1 (LDC2011T09), Part 3 v3.2 (LDC2010T08) and Broadcast News v1.0 (LDC2012T07).

LDC's ATB series follows the constituency or phrase structure approach to treebank development in which clauses are divided into noun phrases and verb phrases and in each sentence, one or more nodes may correspond to one element. Dependency grammar, on the other hand, is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. This is a one-to-one correspondence: for every element in the sentence there is one node in the sentence structure that corresponds to that element. ARL Arabic Dependency Treebank was generated using constituency-to-dependency software written at ARL.

The source data in this release consists of Arabic newswire and broadcast programming collected by LDC from various news and broadcast providers.

The files are in an 11-column tab-separated format with one or more blank lines between sentences. All files are UTF-8 encoded.

ARL Arabic Dependency Treebank is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training was developed by LDC and consists of 448,094 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. The material in this release consists of Chinese source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Chinese Discussion Forums (LDC2016T05).

BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training is distributed via web download.

(3) IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 214 hours of Pashto conversational and scripted telephone speech collected in 2011 and 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Pashto speech in this release represents that spoken in four dialect regions of Afghanistan and Pakistan. The gender distribution among speakers is approximately 30% female, 70% male; speakers' ages range from 17 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are available in two versions: an extended Arabic script and a modified Buckwalter transliteration scheme, both encoded in UTF-8.

IARPA Babel Pashto Language Pack IARPA is distributed via web download.

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the special license agreement. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) GALE Phase 4 Arabic Broadcast News Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source sentences and corresponding English translations selected from broadcast news data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 4 Arabic Broadcast News Parallel Sentences includes 106 source-translation document pairs, comprising 114,251 words (Arabic source) of translated data. Data is drawn from 24 distinct Arabic programs featuring news broadcasts.

GALE Phase 4 Arabic Broadcast News Parallel Sentences is distributed via web download.