Monday, June 22, 2020

LDC 2020 June Newsletter

LDC Releases LORELEI Language Packs for COVID-19 Research


LDC Releases LORELEI Language Packs for COVID-19 Research 

The COVID-19 pandemic has highlighted the importance of data-driven solutions to facilitate rapid response and humanitarian relief, and its global nature demonstrates the need for multi-language resources. To aid in this effort, LDC is releasing data it developed in the DARPA LORELEI program under a special no-cost license for COVID-19 research.

These resources are available in a single corpus:

LDC2020E21 LORELEI Language Packs for COVID-19 Research

This data set includes representative language packs and incident language packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources.

For further information about this corpus and licensing terms, see COVID-19 Research.
___________________________________________________________________________

New publications: 

(1) CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition was developed by LDC and consists of approximately 27 hours of unscripted telephone conversations between native speakers of the Taiwan dialect of Mandarin Chinese. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Mandarin Chinese-Taiwan Dialect (LDC96S56).

The CALLFRIEND collection was conducted by LDC in support of language identification technology development. All data in this release was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(2) SemTransCNC, developed by The Hong Kong Polytechnic University, is a semantic transparency dataset of Chinese nominal compounds built using a series of crowd-based experiments. It contains overall semantic transparency (OST) and constituent semantic transparency (CST) data for 1,176 dimorphemic Chinese nominal compounds, which consist of free morphemes and have mid-range frequencies.

Nominal compounds were selected from the Sinica Corpus and a modern Chinese lexicon. Crowd workers answered questionnaires that included demographic information and questions about the Chinese language. For assessing OST of selected compounds, they answered the question: "How is the sum of the meanings of A and B similar to the meaning of AB?" For assessing CST, they were asked to describe the similarity of A alone to its meaning in AB and the meaning of B alone to its meaning in AB.

SemTransCNC is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(3) TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP English Event Nugget Detection and Coreference tasks in 2014 and 2015.

This release includes source documents, gold standard event nugget annotations in multiple formats, coreference information for the nuggets, and tokenized source documents. Source data consists of English newswire and discussion forum text collected by LDC.

The goal of the Event Nugget track was to evaluate system performance on the detection and coreference of sets of attributes referencing events in unstructured text. Event Nuggets consist of a mention of the event from the text and labels to indicate event type, subtype, and realis (whether or not an event has actually occurred).

TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

Friday, May 15, 2020

LDC 2020 May Newsletter

New Publications:
_______________________________________________________________ 

New publications: 

(1) LORELEI Oromo Incident Language Pack was developed by LDC and is comprised of approximately 3.9 million words of Oromo monolingual text, 25,000 words of English monolingual text, 135,000 words of parallel and comparable Oromo-English text, and 50,000 words of data annotated for Entity Discovery and Linking and Situation Frames. It contains all of the text data, annotations, supplemental resources and related software tools for the Oromo language that were used in the DARPA LORELEI / LoReHLT 2017 Evaluation. 

The evaluation protocol was based on a scenario in which an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity Detection and Linking and Situation Frame annotations identified “entities,” “needs” (such as a need for food) and “issues” (such as civil unrest) to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information that would be useful for planning a disaster response effort. 

The knowledge base for the entity linking annotation in this corpus is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Oromo Incident Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(1) LORELEI Entity Detection and Linking Knowledge Base was developed by LDC and contains the full LORELEI Entity Detection and Linking (EDL) Knowledge Base (KB) used for all LORELEI Representative Language and Incident Language Pack entity linking annotation. The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. 

The KB in this release supported the EDL task in LORELEI for four entity types -- geo-political entities (GPE), locations (LOC), persons (PER) and organizations (ORG) -- and contains a total of 10,216,832 entities. There are four inputs to the KB, each designated by a unique "origin" code in the KB, as follows: GPE and LOC entities from a snapshot of GeoNames, PER entities from the CIA World Leaders List, ORG entities from Appendix B of the CIA World Factbook, and additional entities manually created by LDC for each of the representative and incident languages in the LORELEI Program. 

LORELEI Entity Detection and Linking Knowledge Base is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) BOLT English Translation Treebank - Chinese Discussion Forum was developed by LDC and consists of 147,432 tokens of web discussion forum data translated from Chinese to English and annotated for part-of-speech and syntactic structure. 

The source data is Chinese discussion forum web text collected by LDC in 2011 and 2012, translated into English and released in BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05). A subset of the translated text -- 148 files representing 147,432 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release. 

Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.

BOLT English Translation Treebank - Chinese Discussion Forum is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese was developed by LDC and is comprised of approximately 25 hours of telephone speech in Mandarin Chinese.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. 

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:
Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*