Monday, April 17, 2023

LDC April 2023 Newsletter

In memoriam: Christopher Cieri 1963-2023

New publications:

Penn Korean Universal Dependency Treebank


DEFT English Light and Rich ERE Annotation

______________________________________________________________

In memoriam: Christopher Cieri 1963-2023

With deep sadness, LDC announces the passing of Christopher Cieri, our Executive Director. Chris led the Consortium for over 25 years, guiding its evolution from a small data repository and research hub to a prominent global data center. 

An accomplished linguist and computer scientist and a well-read humanist, Chris embodied the best qualities for executing the wide range of duties demanded by his leadership role. He was a valued colleague and friend and will be sorely missed.

All are welcome to visit our remembrance page for Chris.


New publications:
 
Penn Korean Universal Dependency Treebank contains 5010 sentences and 132,041 tokens annotated in dependency format under the Universal Dependencies framework. It is a conversion of Korean Treebank Annotations Version 2.0 (LDC2006T09) which was produced in constituency format. 

The source text is newswire stories from LDC’s Korean Press Agency collection contained in Korean Newswire (LDC2000T45). Sentences were automatically converted for dependency annotation; the output was manually checked. The corpus contains 112 files in CoNLL-U format, the Universal Dependencies standard, with a mapping to their counterpart in LDC2006T09.

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*

DEFT English Light and Rich ERE Annotation was developed by LDC and consists of 1190 English discussion forum, newswire and proxy documents annotated for entities, relations, and events (ERE). Light ERE annotation labels entity mentions for the target set of entity, relation, and event types between and among those entities, including coreference. Rich ERE annotation expands types and tagging in the entities, relations, and events annotation tasks and replaces strict event coreference with a more loosely defined event hopper annotation.
 
902 documents were annotated following Light ERE annotation guidelines. 288 documents were labeled with Rich ERE annotation in a second pass after being annotated for Light ERE. The source data consists of English discussion forum web text collected by LDC for the DARPA BOLT program and contained in BOLT English Discussion Forums (LDC2017T11); newswire documents published in various data sets released in the TAC KBP project (Text Analysis Conference Knowledge Base Population); and proxy documents intended to mimic government analysis reports of newswire content published in DEFT Narrative Text (LDC2016T07).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.