Thursday, April 15, 2021

LDC April 2021 Newsletter

New Publications:

X-SRL: Parallel Cross-lingual Semantic Role Labeling
TAC KBP English Sentiment Slot Filling – Comprehensive Training and Evaluation Data 2013-2014
_____________________________________________________________________________

New Publications:

(1) X-SRL: Parallel Cross-lingual Semantic Role Labeling was developed by Heidelberg University, Department of Computational Linguistics and the Leibniz Institute for the German Language (IDS). It consists of approximately three million words of German, French and Spanish annotated for semantic role labeling. The texts are translations of the English portion of 2009 CoNLL Shared Task Part 2 (LDC2012T04). All sentences have annotations for verbal predicates and share the original English Propbank label set across the four languages.

The 2009 CoNLL Shared Task developed syntactic dependency annotations, including the semantic dependency model roles of both verbal and nominal predicates. The following English data was used in the shared task:
For X-SRL, the English source data was automatically translated using DeepL. Automatic tokenization, lemmatization, part-of-speech tagging and syntactic parsing were then applied to the text. The data was divided into train, development and test partitions. Semantic labels were transferred for the train and development sections, and the test sentences were validated for translation quality, alignment, label transfer, and filtering.

X-SRL: Parallel Cross-lingual Semantic Role Labeling is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) TAC KBP English Sentiment Slot Filling – Comprehensive Training and Evaluation Data 2013-2014 was developed by LDC and contains training and evaluation data produced in support of the 2013 and 2014 TAC KBP Sentiment Slot Filling tracks. The data in this release includes queries, manual runs (human-produced query responses), and assessment results for human- and system-produced query responses. Source data was English news and web text.

The regular English Slot Filling track involved mining information about entities from text using a specified set of "slots", or attributes. The goal of the Sentiment Slot Filling task was to evaluate the quality of detectors for positive and negative sentiment.

TAC KBP English Sentiment Slot filling – Comprehensive Training and Evaluation Data 2013-2014 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, March 15, 2021

LDC 2021 March Newsletter

LDC data and commercial technology development 

New Publications:
Columbia Games Corpus
Global TIMIT Mandarin Chinese
BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

_________________________________________________________________________


LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.



New publications:

(1) Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation from 13 subjects playing a series of computer games that required verbal communication to achieve joint goals of identifying and moving images on the screen to reach a combined number of points. This publication also includes corresponding manually time-aligned orthographic transcripts and annotation marking discourse and turn-taking.

Columbia Games Corpus is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 


*

(2Global TIMIT Mandarin Chinese was developed by LDC and Shanghai Jiao Tong University and consists of five hours of read speech from Chinese Gigaword Fifth Edition (LDC2011T13) with corresponding transcripts. Fifty speakers read 120 sentences; specifically, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types.

The corpus was recorded at Shanghai Jiao Tong University, China. Speakers (25 female, 25 male) were students at the university and had achieved Class 2 Level 1 or better on Putonghua Shuiping Ceshi (the national standard Mandarin proficiency test).

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original 
TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. 

Global TIMIT Mandarin Chinese is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*


(3) BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Chinese informal text. 

Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation (i.e., 
Chinese Treebank 9.0 (LDC2016T13)) and covers noun phrases (including proper nouns, nominals, pronouns, and null arguments), possessives, proper noun pre-modifiers, and verbs.

Discussion forum data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. Telephone speech data was taken from LDC's Chinese CALLHOME and CALLFRIEND telephone collections.

The DARPA 
BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT Chinese Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

 

                                                                          

Monday, February 15, 2021

LDC 2021 February Newsletter

2021 Membership Discounts Expire March 1 

New Publications:
Althingi Parliamentary Speech
Penn Discourse Treebank 2.0 – German Translation 
TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010
________________________________________________________________________


2021 Membership Discounts Expire March 1

Time is running out to save on 2021 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC

New publications:


(1) Althingi Parliamentary Speech consists of approximately 540 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary, and language models. Speeches date from 2005-2016. This data set was collected in 2016 by the ASR for Althingi project at Reykjavik University in collaboration with the Althingi speech department. The purpose of that project was to develop an ASR (automatic speech recognition) system for Icelandic parliamentary speech to replace the procedure of manually transcribing performed speeches. 

The mean speech length is 6 minutes, with speeches ranging from under 1 minute up to around 30 minutes. The corpus features 197 speakers (105 male, 92 female) and is split into training, development, and evaluation sets. 

Althingi Parliamentary Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*


(2Penn Discourse Treebank 2.0 – German Translation  was developed at the University of Potsdam’s Applied Computational Linguistics group and consists of approximately one million tokens derived from Penn Discourse Treebank Version 2.0 (LDC2008T05) translated into German and annotated for shallow discourse relations. The aim of the Penn Discourse Treebank  project is to annotate the Wall Street Journal section in Treebank-2 (LDC95T7) with discourse relations. PDTB-German is based on a subset of PDTB2.0 used in the 2016 CoNLL Shared Task on Multilingual Shallow Discourse Parsing.

Data is in CoNLL format. Text was automatically translated with deepL, and projections of the annotations using word alignments were produced with GIZA++.

Penn Discourse Treebank 2.0 – German Translation is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*


(3) TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010 contains the training and evaluation data (queries, manual runs, final assessment results) produced by LDC to support the 2010 Surprise Slot Filling Track, the only year in which the track was run. 

The regular English Slot Filling track involved mining information about entities from text using a specified set of "slots" or attributes. The goal of the Surprise Slot Filling task was to support the development of information extraction systems that could rapidly adapt to new types of relations and events. Surprise Slot Filling participants were given four new slot types -- "diseases", "awards-won" and "charity-supported" for persons, and "products" for organizations -- along with annotation guidelines and training data. They were instructed to develop their systems and to run them on the source collection in four days.

The corresponding source document collections cover English newswire, broadcast material, and web text. These documents  are included in TAC KBP Comprehensive English Source Corpora 2009-2014 (LDC2018T03). The corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - is contained in TAC KBP Reference Knowledge Base (LDC2014T16) .

TAC-KBP English Surprise Slot Filling – Comprehensive Training and Evaluation Data 2010 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

 

Friday, January 15, 2021

LDC 2021 January Newsletter

Renew Your LDC Membership Today

New Publications:
LORELEI Akan Representative Language Pack
ATIS – Seven Languages
BOLT English Treebank – SMS/Chat

_____________________________________________________________________


Renew Your LDC Membership Today 
Curated language resources are more important than ever to support research and language technology development, including the expanding fields around remote work, pandemic-related technologies, and non-contact interactions. LDC members enjoy no-cost access to 30+ new corpora released annually, as well as the ability to license legacy data sets at reduced fees. Ensure that your data needs continue to be met by renewing your LDC membership or by joining the Consortium today. 

Now through March 1, 2021, 2020 members receive a 10% discount on 2021 membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for more details on membership options and benefits. 


New publications:


(1) LORELEI Akan Representative Language Pack consists of Akan monolingual text, Akan-English parallel text, annotations, supplemental resources, and related software tools developed by LDC for the DARPA LORELEI program.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons, and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

Data was collected from discussion forum, news, reference, social network, and weblog. Data volumes are as follows:

  • Over 3.3 million words of Akan monolingual text, all of which were translated into English
  • 115,000 Akan words translated from English data


Approximately 2,300 words were annotated for named entities, full entity including nominals and pronouns, entity linking, simple semantic annotation, and situation frame annotation (identifying entities, needs, and issues). Around 2,000 words have morphological segmentation annotation.

LORELEI Akan Representative Language Pack is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*


(2) ATIS – Seven Languages was developed by Amazon Web Services, Inc. and consists of 5,871 English utterances from ATIS (Air Travel Information Services) corpora, specifically ATIS2 (LDC93S5)ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26), translated into six languages: Spanish, German, French, Portuguese, Chinese, and Japanese.

The ATIS collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory of Computer Science, National Institute for Standards and Technology, and SRI International.

The data is separated into 4,978 utterances for training and 893 utterances for testing following the original ATIS division. The source English utterances were manually translated into the six languages and are included in this release. Each utterance was annotated with named entities via table lookup; markers include city, airline, airport names, and dates.

ATIS Seven Languages is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*


(3) BOLT English Treebank – SMS/Chat was developed by LDC and consists of English SMS and text chat data with part-of-speech and syntactic structure annotation.

The source data consists of 115,667 tokens/words in 484 files of English SMS and text chat collected by LDC using two methods: new collection via LDC's collection platform and donation of SMS or chat archives from BOLT collection participants. 

All data was annotated for word-level tokenization, part-of-speech, and syntactic structure. Annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Those changes primarily concerned the tokenization of hyphenated words, part-of-speech, and tree changes necessitated by the tokenization changes, and updates to the syntactic annotation to comply with updated annotation guidelines. Supplementary guidelines for English treebanks and web text are included with this release.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT English Treebank – SMS/Chat is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.