Linguistic Data Consortium: Penn Discourse Treebank

Showing posts with label Penn Discourse Treebank. Show all posts

Wednesday, September 15, 2021

LDC September 2021 Newsletter

New Publications:

RATS Speaker Identification

Classical Arabic Dictionary

DiscAlign for Penn and RST Discourse Treebanks

_________________________________________________________________

New publications:

(1) RATS Speaker Identification was developed by LDC and is comprised of approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotations of speech segments. The audio was retransmitted over eight channels, for 17,000 hours of total speech. The corpus was created to provide training and development sets for the speaker identification task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings collected by LDC specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, speaker ID, speaker ID provenance, language ID, and language ID provenance.

RATS Speaker Identification is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Classical Arabic Dictionary consists of approximately one hundred million words of Arabic collected from texts dating between 431 and 1104 CE, principally books and essays, along with word occurrences, source documents and related metadata.

The dictionary is presented in three formats: plain text in UTF-8 encoding, plain text in CP1256 encoding, and as an SQL database file. Source documents are presented in UTF-8 and CP1256 encodings.

Classical Arabic Dictionary is distributed via web download.

(3) DiscAlign for Penn and RST Discourse Treebanks was developed by Saarland University. It consists of alignment information for the discourse annotations contained in Penn Discourse Treebank Version 2.0 (LDC2008T05) (PDTB 2.0) and RST Discourse Treebank (LDC2002T07) (RST-DT). PDTB 2.0 and RST-DT annotations overlap for 385 newspaper articles in sections 6, 11, 13, 19 and 23 of the Wall Street Journal corpus contained in Treebank-2 (LDC95T7). DiscAlign for Penn and RST Discourse Treebanks contains approximately 6,700 alignments between PDTB 2.0 and RST-DT relations.

DiscAlign for Penn and RST Treebanks is available at no cost to all licensees of PDTB 2.0 and RST-DT and appears in their download queues associated with these corpora as DiscAlign_Penn_RST_DTB_LDC2021T16.zip.

Monday, February 15, 2021

LDC 2021 February Newsletter

2021 Membership Discounts Expire March 1

New Publications:
Althingi Parliamentary Speech
Penn Discourse Treebank 2.0 – German Translation
TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010
________________________________________________________________________

2021 Membership Discounts Expire March 1

Time is running out to save on 2021 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC.

New publications:

(1) Althingi Parliamentary Speech consists of approximately 540 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary, and language models. Speeches date from 2005-2016. This data set was collected in 2016 by the ASR for Althingi project at Reykjavik University in collaboration with the Althingi speech department. The purpose of that project was to develop an ASR (automatic speech recognition) system for Icelandic parliamentary speech to replace the procedure of manually transcribing performed speeches.

The mean speech length is 6 minutes, with speeches ranging from under 1 minute up to around 30 minutes. The corpus features 197 speakers (105 male, 92 female) and is split into training, development, and evaluation sets.

Althingi Parliamentary Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Penn Discourse Treebank 2.0 – German Translation was developed at the University of Potsdam’s Applied Computational Linguistics group and consists of approximately one million tokens derived from Penn Discourse Treebank Version 2.0 (LDC2008T05) translated into German and annotated for shallow discourse relations. The aim of the Penn Discourse Treebank project is to annotate the Wall Street Journal section in Treebank-2 (LDC95T7) with discourse relations. PDTB-German is based on a subset of PDTB2.0 used in the 2016 CoNLL Shared Task on Multilingual Shallow Discourse Parsing.

Data is in CoNLL format. Text was automatically translated with deepL, and projections of the annotations using word alignments were produced with GIZA++.

Penn Discourse Treebank 2.0 – German Translation is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010 contains the training and evaluation data (queries, manual runs, final assessment results) produced by LDC to support the 2010 Surprise Slot Filling Track, the only year in which the track was run.

The regular English Slot Filling track involved mining information about entities from text using a specified set of "slots" or attributes. The goal of the Surprise Slot Filling task was to support the development of information extraction systems that could rapidly adapt to new types of relations and events. Surprise Slot Filling participants were given four new slot types -- "diseases", "awards-won" and "charity-supported" for persons, and "products" for organizations -- along with annotation guidelines and training data. They were instructed to develop their systems and to run them on the source collection in four days.

The corresponding source document collections cover English newswire, broadcast material, and web text. These documents are included in TAC KBP Comprehensive English Source Corpora 2009-2014 (LDC2018T03). The corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - is contained in TAC KBP Reference Knowledge Base (LDC2014T16) .

TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, March 15, 2019

LDC 2019 March Newsletter

Call for Papers - LTC 2019, LREC 2020

New Publications:

CALLFRIEND Egyptian Arabic Second Edition

Penn Discourse Treebank Version 3.0
VAST Chinese Speech and Transcripts

___________________________________________________________

Call for Papers

The 9^th Language & Technology Conference (LTC 2019) will take place on May 17-19, 2019 at the Adam Mickiewicz University in Poznań, Poland. LTC addresses Human Language Technologies as a challenge for computer science, linguistics and related fields. Conference papers are due next week on Wednesday, March 20, 2019 (midnight, any time zone). For more information, visit the conference webpage.

The 12^th Conference on Language Resources and Evaluation (LREC 2020) will take place on May 13-15, 2020 at the Palais du Pharo in Marseille, France. LREC aims to provide an overview of the state-of-the-art, explore new R&D directions and emerging trends, and exchange information regarding language resources and their applications, evaluation methodologies and tools. Conference papers are due by November 25, 2019. For more information, including conference topics, visit the conference webpage.

New Publications:

(1) CALLFRIEND Egyptian Arabic Second Edition was developed by LDC and consists of approximately 25 hours of unscripted telephone conversations between native speakers of Egyptian Arabic. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Egyptian Arabic (LDC96S49).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Egyptian Arabic Second Edition is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Penn Discourse Treebank Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains over 40,600 tokens of annotated relations. In Version 3, an additional 13,000 tokens were annotated, certain pairwise annotations were standardized, new senses were included and the corpus was subject to a series of consistency checks.

This corpus contains two tools: (1) The Annotator, used for annotation and adjudication, and which can also be used for viewing the corpus; and (2) The Conversion Tool for converting Version 2 annotation files into the Version 3 format.

The documentation directory contains a manual describing what is new in Version 3 and how Version 3 differs from Version 2; the methods and guidelines used in annotating PDTB Version 3; and a range of statistics on the tokens, including the frequency of each connective, its sense labels and its modifiers. More information about the corpus and research carried out by the developers and others using the corpus can be found on the PDTB website.

Penn Discourse Treebank Version 3.0 is distributed via web download.

(3) VAST Chinese Speech and Transcripts was developed by LDC for the VAST (Video Annotation for Speech Technologies) project and is comprised of approximately 29 hours of Mandarin Chinese audio extracted from amateur video content harvested from the web and corresponding time-aligned transcripts.

Audio files were transcribed using XTrans, which supports manual transcription across multiple channels, languages and platforms. Transcribers followed a Quick-Rich Transcription style; transcription guidelines are included in this release.

The aim of the VAST project was to collect and annotate data in several languages to support the development of speech technologies such as speech activity detection, language identification, speaker identification, and speech recognition.

VAST Chinese Speech and Transcripts is distributed via web download.