Linguistic Data Consortium: August 2020

LDC adds DOI Identifier to its Language Resources
Fall 2020 LDC Data Scholarship Program

New Publications:
LORELEI Vietnamese Representative Language Pack
DEFT Chinese Light and Rich ERE Annotation
CALLFRIEND American English – Southern Dialect Second Edition

LDC adds DOI Identifier to its Language Resources
As of July 2020, LDC’s language resources include a Digital Object Identifier (DOI), an internationally recognized identification standard for online digital material. DOIs are alpha numeric strings that correspond to URLs and metadata for specified resources. They are expressed as links that resolve to the object’s online location. For example, the DOI for Penn Parsed Corpora of Historical English LDC2020T16 is https://doi.org/10.35111/4hzx-5483, which leads users to the LDC catalog entry for this data set. To facilitate its assignment and administration of DOIs, LDC has joined DataCite, a global DOI provider for research data. (DOIs for resources released before July 2020 will be assigned through a process expected to be completed shortly.) LDC data sets now have four persistent identifiers: a unique LDC number, ISBN, ISLRN, and DOI. Adding DOIs is consistent with our aim to follow best practices for archiving and curating digital resources, evidenced by the CoreTrustSeal certification which recognizes the LDC Catalog as a trustworthy data repository.

Fall 2020 LDC Data Scholarship Program
Student applications for the Fall 2020 LDC Data Scholarship program are being accepted now through September 15, 2020. This scholarship program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, visit the LDC Data Scholarship page.

New publications:
(1) LORELEI Vietnamese Representative Language Pack consists of Vietnamese monolingual text, Vietnamese-English parallel text, annotations, supplemental resources, and related software tools developed by LDC for the DARPA LORELEI program.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons, and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

Data was collected in the following genres: discussion forum, news, reference, social network, and weblogs. Data volumes are as follows:

Over 172 million words of Vietnamese monolingual text, approximately 325,000 words of which were translated into English
106,000 Vietnamese words translated from English data
1.9 million words of found parallel text

Approximately 75,000 words were annotated for named entities and up to 25,000 words contain additional annotation, including situation frames (identifying entities, needs, and issues) and entity linking and detection.

LORELEI Vietnamese Representative Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) DEFT Chinese Light and Rich ERE Annotation contains Chinese discussion forum web text annotated for entities, relations, and events (ERE) using the ERE Light and ERE Rich annotations schemas developed by LDC. Light ERE annotation labels entity mentions for the target set of ERE types between and among those entities, including coreference. Rich ERE annotation expands types and tagging for ERE annotation tasks and replaces event coreference with event hopper annotation. All files in this release (157) were annotated following Light ERE guidelines; a subset (149) were also labeled with Rich ERE annotation.

DARPA’s Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships, and anomaly detection. LDC supported the DEFT program by collecting, creating, and annotating a variety of data sources.

DEFT Chinese Light and Rich ERE Annotation is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) CALLFRIEND American English – Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Southern dialects of American English. This second edition updates the audio files to wav format, simplifies the directory structure, and adds documentation and metadata. The first edition is available as CALLFRIEND American English-Southern Dialect (LDC96S47).

The CALLFRIEND collection was conducted by LDC in support of language identification technology development. All data in this release was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND American English – Southern Dialect Second Edition is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Tuesday, August 18, 2020

LDC 2020 August Newsletter