Tuesday, September 15, 2020

LDC 2020 September Newsletter

New Publications:
BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech
LORELEI Tigrinya Incident Language Pack
Chinese Lexical Resources for Gender, Number, Animacy

New publications:
(1) BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech was developed by the University of Colorado, Boulder – CLEAR (Computational Language and Education Research) and consists of propbank and verb sense disambiguation annotation on English discussion forum (DF), SMS/Chat, and conversational telephone speech data. Annotation was applied to each predicate verb tree in LDC’s BOLT phrase structure treebanks. PropBank provides a layer of semantic annotation over treebank and was performed on all three genres. DF and SMS/Chat data were also annotated for verb sense disambiguation using Verbnet 3.2 classes

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and Conversational Telephone Speech is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may this data for a fee.

*

(2) LORELEI Tigrinya Incident Language Pack was developed by LDC and is comprised of approximately 4.5 million words of Tigrinya monolingual text, 25,000 words of English monolingual text, 235,000 words of parallel and comparable Tigrinya-English text, and 50,000 words of data annotated for Entity Discovery and Linking and for Situation Frames. It contains all of the text data, annotations, supplemental resources, and related software tools for the Tigrinya language that were used in the DARPA LORELEI / LoReHLT 2017 Evaluation.

The evaluation protocol was based on a scenario in which an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time. 

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity Detection and Linking and Situation Frame annotations identified “entities,” “needs” (such as a need for food), and “issues” (such as civil unrest) to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information that would be useful for planning a disaster response effort.

The knowledge base for the entity linking annotation in this corpus is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Tigrinya Incident Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a  fee.

*

(3) Chinese Lexical Resources for Gender, Number, Animacy was developed by LDC and consists of gender, number, and animacy lexicons produced in support of the DARPA DEFT program. Gender, number, and animacy are lexical indicators useful for named entity tagging, including the detection of person mentions in text.

This corpus was created by extracting information from newswire texts in Chinese Gigaword Fifth Edition (LDC2011T13) in the following steps: (1) segmenting source documents into sentences; (2) converting any traditional Chinese script to simplified Chinese; (3) tagging all sentences for parts-of-speech; (4) developing queries to detect patterns; and (5) building lexicons based on frequency counts and entity types.

The resulting resources include dictionaries of Chinese animate nominals and names; Chinese nominals and name with gender and number predicted; and other dictionaries of Chinese nominals, names, verbs, and pronouns. Each dictionary contains frequency information as well as the features in question.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

Chinese Lexical Resources for Gender, Number, Animacy is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, August 18, 2020

LDC 2020 August Newsletter

LDC adds DOI Identifier to its Language Resources
Fall 2020 LDC Data Scholarship Program

New Publications:
LORELEI Vietnamese Representative Language Pack
DEFT Chinese Light and Rich ERE Annotation
CALLFRIEND American English – Southern Dialect Second Edition


 LDC adds DOI Identifier to its Language Resources
As of July 2020, LDC’s language resources include a Digital Object Identifier (DOI), an internationally recognized identification standard for online digital material. DOIs are alpha numeric strings that correspond to URLs and metadata for specified resources. They are expressed as links that resolve to the object’s online location. For example, the DOI for Penn Parsed Corpora of Historical English LDC2020T16 is https://doi.org/10.35111/4hzx-5483, which leads users to the LDC catalog entry for this data set. To facilitate its assignment and administration of DOIs, LDC has joined DataCite, a global DOI provider for research data. (DOIs for resources released before July 2020 will be assigned through a process expected to be completed shortly.) LDC data sets now have four persistent identifiers: a unique LDC number, ISBN, ISLRN, and DOI. Adding DOIs is consistent with our aim to follow best practices for archiving and curating digital resources, evidenced by the CoreTrustSeal certification which recognizes the LDC Catalog as a trustworthy data repository.

Fall 2020 LDC Data Scholarship Program
Student applications for the Fall 2020 LDC Data Scholarship program are being accepted now through September 15, 2020. This scholarship program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, visit the LDC Data Scholarship page.

 


 New publications:
(1) LORELEI Vietnamese Representative Language Pack consists of Vietnamese monolingual text, Vietnamese-English parallel text, annotations, supplemental resources, and related software tools developed by LDC for the DARPA LORELEI program.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons, and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.

Data was collected in the following genres: discussion forum, news, reference, social network, and weblogs. Data volumes are as follows:

  • Over 172 million words of Vietnamese monolingual text, approximately 325,000 words of which were translated into English
  • 106,000 Vietnamese words translated from English data
  • 1.9 million words of found parallel text
Approximately 75,000 words were annotated for named entities and up to 25,000 words contain additional annotation, including situation frames (identifying entities, needs, and issues) and entity linking and detection.

LORELEI Vietnamese Representative Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
                                                                                                                                                                       

(2) DEFT Chinese Light and Rich ERE Annotation contains Chinese discussion forum web text annotated for entities, relations, and events (ERE) using the ERE Light and ERE Rich annotations schemas developed by LDC. Light ERE annotation labels entity mentions for the target set of ERE types between and among those entities, including coreference. Rich ERE annotation expands types and tagging for ERE annotation tasks and replaces event coreference with event hopper annotation. All files in this release (157) were annotated following Light ERE guidelines; a subset (149) were also labeled with Rich ERE annotation. 

DARPA’s Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships, and anomaly detection. LDC supported the DEFT program by collecting, creating, and annotating a variety of data sources.

DEFT Chinese Light and Rich ERE Annotation is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
                                                                                                                                                                              

(3) CALLFRIEND American English – Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Southern dialects of American English. This second edition updates the audio files to wav format, simplifies the directory structure, and adds documentation and metadata. The first edition is available as CALLFRIEND American English-Southern Dialect (LDC96S47).

The CALLFRIEND collection was conducted by LDC in support of language identification technology development. All data in this release was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND American English – Southern Dialect Second Edition is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, July 15, 2020

LDC 2020 July Newsletter

Penn Parsed Corpora of Historical English Now Available From LDC
Fall 2020 LDC Data Scholarship Program 


New Publications:
Speech Sentiment Annotations
Penn Parsed Corpora of Historical English
IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b
BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training
____________________________________________________________

Penn Parsed Corpora of Historical English Now Available From LDC

LDC is pleased to announce that the Penn Parsed Corpora of Historical English (LDC2020T16) – an important community resource for 20 years – is now available for licensing in the LDC Catalog. Developed by University of Pennsylvania researchers in the Linguistics Department under the direction of Professor Anthony Kroch, this data set consists of syntactic annotation of English prose texts from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE) represented in three corpora:

  • The Penn-Helsinki Corpus of Middle English, second edition
  • The Penn-Helsinki Parsed Corpus of Early Modern English
  • The Penn Parsed Corpus of Modern British English, second edition
This release also includes annotation guidelines and philological information for each corpus as well as the CorpusSearch 2 program which allows users to search the data for words, word sequences and syntactic structure.

In addition to being of value to students and scholars of the history of English, this data set is useful to computational linguists for domain adaptation. More information about this project is available from the Penn Parsed Corpora of Historical English homepage.

Current licensees should contact LDC’s membership office with any questions regarding access to this data set.

Fall 2020 LDC Data Scholarship Program

Student applications for the Fall 2020 LDC Data Scholarship program are being accepted now through September 15, 2020. This scholarship program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.
____________________________________________________________

New publications:

(1) Speech Sentiment Annotations was developed by Google Inc. and consists of sentiment labels (positive, negative, neutral) for approximately 49,500 utterances covering 140 hours of audio from Switchboard-1 Release 2 (LDC97S62).

Switchboard speech files were segmented based on the start and end time of transcript turns. Annotators listened to the audio corresponding to each segment (utterance) and classified each into positive, negative or neutral categories based on the emotion and attitude of the speaker. Annotators provided a justification for positive and negative classifications using a flow chart. Further information about the methodology and annotation process is contained in the documentation accompanying this release.

Switchboard-1 Release 2 (LDC97S62) consists of 260 hours of telephone speech from 543 speakers across the United States (302 male speakers, 241 female speakers). A computer-driven telephone collection platform paired two subjects for each conversation and provided a discussion topic, ensuring that no two speakers conversed together more than once and no one speaker talked more than once on a given topic.

Speech Sentiment Annotations is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Penn Parsed Corpora of Historical English was developed at the University of Pennsylvania and consists of running texts and text samples of British English prose from the earliest Middle English documents (1100 CE) up to the period of the First World War (1914 CE). This data set contains three corpora covering traditionally recognized periods of English: 
  • The Penn-Helsinki Parsed Corpus of Middle English, second edition
  • The Penn-Helsinki Parsed Corpus of Early Modern English
  • The Penn Parsed Corpus of Modern British English, second edition
The texts are in three forms: plain text, part-of-speech tagged text, and syntactically annotated text. This release also includes annotation guidelines, philological information for each corpus and the CorpusSearch 2 program, which allows users to search the data for words, word sequences and syntactic structure.

The Penn Parsed Corpora of Historical English were designed for students and scholars of the history of English, especially the historical syntax of the language. They have also been used by computational linguists for domain adaptation. See the Penn Parsed Corpora of Historical English homepage for more information about this project.

Penn Parsed Corpora of Historical English is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
* 

(3) IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Javanese conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts. 

The Javanese speech in this release represents the Central, Western, and Eastern Javanese dialect regions of Indonesia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

* 
(4) BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training was developed by LDC and consists of 158,651 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations. 

The source data in this release consists of transcripts of Chinese conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC96S34, LDC96T16, LDC96S55) that were translated into English by professional translation agencies and annotated for the word alignment task.

The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment and were also tokenized for character alignment by inserting white spaces to separate characters.

BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

Monday, June 22, 2020

LDC 2020 June Newsletter

LDC Releases LORELEI Language Packs for COVID-19 Research


LDC Releases LORELEI Language Packs for COVID-19 Research 

The COVID-19 pandemic has highlighted the importance of data-driven solutions to facilitate rapid response and humanitarian relief, and its global nature demonstrates the need for multi-language resources. To aid in this effort, LDC is releasing data it developed in the DARPA LORELEI program under a special no-cost license for COVID-19 research.

These resources are available in a single corpus:

LDC2020E21 LORELEI Language Packs for COVID-19 Research

This data set includes representative language packs and incident language packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources.

For further information about this corpus and licensing terms, see COVID-19 Research.
___________________________________________________________________________

New publications: 

(1) CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition was developed by LDC and consists of approximately 27 hours of unscripted telephone conversations between native speakers of the Taiwan dialect of Mandarin Chinese. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Mandarin Chinese-Taiwan Dialect (LDC96S56).

The CALLFRIEND collection was conducted by LDC in support of language identification technology development. All data in this release was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(2) SemTransCNC, developed by The Hong Kong Polytechnic University, is a semantic transparency dataset of Chinese nominal compounds built using a series of crowd-based experiments. It contains overall semantic transparency (OST) and constituent semantic transparency (CST) data for 1,176 dimorphemic Chinese nominal compounds, which consist of free morphemes and have mid-range frequencies.

Nominal compounds were selected from the Sinica Corpus and a modern Chinese lexicon. Crowd workers answered questionnaires that included demographic information and questions about the Chinese language. For assessing OST of selected compounds, they answered the question: "How is the sum of the meanings of A and B similar to the meaning of AB?" For assessing CST, they were asked to describe the similarity of A alone to its meaning in AB and the meaning of B alone to its meaning in AB.

SemTransCNC is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(3) TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP English Event Nugget Detection and Coreference tasks in 2014 and 2015.

This release includes source documents, gold standard event nugget annotations in multiple formats, coreference information for the nuggets, and tokenized source documents. Source data consists of English newswire and discussion forum text collected by LDC.

The goal of the Event Nugget track was to evaluate system performance on the detection and coreference of sets of attributes referencing events in unstructured text. Event Nuggets consist of a mention of the event from the text and labels to indicate event type, subtype, and realis (whether or not an event has actually occurred).

TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

Friday, May 15, 2020

LDC 2020 May Newsletter

New Publications:
_______________________________________________________________ 

New publications: 

(1) LORELEI Oromo Incident Language Pack was developed by LDC and is comprised of approximately 3.9 million words of Oromo monolingual text, 25,000 words of English monolingual text, 135,000 words of parallel and comparable Oromo-English text, and 50,000 words of data annotated for Entity Discovery and Linking and Situation Frames. It contains all of the text data, annotations, supplemental resources and related software tools for the Oromo language that were used in the DARPA LORELEI / LoReHLT 2017 Evaluation. 

The evaluation protocol was based on a scenario in which an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity Detection and Linking and Situation Frame annotations identified “entities,” “needs” (such as a need for food) and “issues” (such as civil unrest) to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information that would be useful for planning a disaster response effort. 

The knowledge base for the entity linking annotation in this corpus is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Oromo Incident Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(1) LORELEI Entity Detection and Linking Knowledge Base was developed by LDC and contains the full LORELEI Entity Detection and Linking (EDL) Knowledge Base (KB) used for all LORELEI Representative Language and Incident Language Pack entity linking annotation. The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. 

The KB in this release supported the EDL task in LORELEI for four entity types -- geo-political entities (GPE), locations (LOC), persons (PER) and organizations (ORG) -- and contains a total of 10,216,832 entities. There are four inputs to the KB, each designated by a unique "origin" code in the KB, as follows: GPE and LOC entities from a snapshot of GeoNames, PER entities from the CIA World Leaders List, ORG entities from Appendix B of the CIA World Factbook, and additional entities manually created by LDC for each of the representative and incident languages in the LORELEI Program. 

LORELEI Entity Detection and Linking Knowledge Base is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) BOLT English Translation Treebank - Chinese Discussion Forum was developed by LDC and consists of 147,432 tokens of web discussion forum data translated from Chinese to English and annotated for part-of-speech and syntactic structure. 

The source data is Chinese discussion forum web text collected by LDC in 2011 and 2012, translated into English and released in BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05). A subset of the translated text -- 148 files representing 147,432 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release. 

Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.

BOLT English Translation Treebank - Chinese Discussion Forum is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese was developed by LDC and is comprised of approximately 25 hours of telephone speech in Mandarin Chinese.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. 

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:
Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*