LDC adds DOI Identifier to its Language Resources
Fall 2020 LDC Data Scholarship Program
New Publications:
LORELEI Vietnamese
Representative Language Pack
DEFT Chinese Light and Rich
ERE Annotation
CALLFRIEND American English
– Southern Dialect Second Edition
LDC adds DOI Identifier to its Language Resources
As of July 2020, LDC’s language resources include a Digital Object
Identifier (DOI), an internationally recognized identification standard for
online digital material. DOIs are alpha numeric strings that correspond to URLs
and metadata for specified resources. They are expressed as links that resolve
to the object’s online location. For example, the DOI for Penn Parsed Corpora
of Historical English LDC2020T16 is https://doi.org/10.35111/4hzx-5483, which leads users to the LDC catalog entry for this
data set. To facilitate its assignment and administration of DOIs, LDC has
joined DataCite, a global DOI
provider for research data. (DOIs for resources released before July 2020 will
be assigned through a process expected to be completed shortly.) LDC data sets
now have four persistent identifiers: a unique LDC number, ISBN, ISLRN, and DOI. Adding DOIs is consistent with our aim to
follow best practices for archiving and curating digital resources, evidenced
by the CoreTrustSeal certification
which recognizes the LDC Catalog as a trustworthy data repository.
Fall 2020 LDC Data Scholarship Program
Student applications for the Fall 2020 LDC Data Scholarship program are
being accepted now through September 15, 2020. This scholarship program
provides eligible students with no-cost access to LDC data. Students must
complete an application consisting of a data use proposal and letter of support
from their advisor.
For application requirements and program rules, visit the LDC
Data Scholarship page.
New publications:
(1) LORELEI Vietnamese
Representative Language Pack consists of Vietnamese monolingual text,
Vietnamese-English parallel text, annotations, supplemental resources, and
related software tools developed by LDC for the DARPA LORELEI program.
The LORELEI (Low Resource Languages for Emergent Incidents) program was
concerned with building human language technology for low resource languages in
the context of emergent situations like natural disasters or disease outbreaks.
Linguistic resources for LORELEI include Representative Language Packs and
Incident Language Packs for over two dozen low resource languages, comprising
data, annotations, basic natural language processing tools, lexicons, and
grammatical resources. Representative languages were selected to provide broad
typological coverage, while incident languages were selected to evaluate system
performance on a language whose identity was disclosed at the start of the
evaluation.
Data was collected in the following genres: discussion forum, news, reference,
social network, and weblogs. Data volumes are as follows:
- Over 172 million words of Vietnamese monolingual text, approximately 325,000 words of which were translated into English
- 106,000 Vietnamese words translated from English data
- 1.9 million words of found parallel text
LORELEI Vietnamese Representative Language Pack is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
(2) DEFT Chinese Light and Rich ERE Annotation contains Chinese discussion forum web text annotated for entities, relations, and events (ERE) using the ERE Light and ERE Rich annotations schemas developed by LDC. Light ERE annotation labels entity mentions for the target set of ERE types between and among those entities, including coreference. Rich ERE annotation expands types and tagging for ERE annotation tasks and replaces event coreference with event hopper annotation. All files in this release (157) were annotated following Light ERE guidelines; a subset (149) were also labeled with Rich ERE annotation.
DARPA’s Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships, and anomaly detection. LDC supported the DEFT program by collecting, creating, and annotating a variety of data sources.
DEFT Chinese Light and Rich ERE Annotation is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
(3) CALLFRIEND American English – Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Southern dialects of American English. This second edition updates the audio files to wav format, simplifies the directory structure, and adds documentation and metadata. The first edition is available as CALLFRIEND American English-Southern Dialect (LDC96S47).
The CALLFRIEND collection was conducted by LDC in support of language identification technology development. All data in this release was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.
CALLFRIEND American English – Southern Dialect Second Edition is distributed via web download.
2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
No comments:
Post a Comment