Linguistic Data Consortium: May 2024

Thursday, May 16, 2024

LDC May 2024 Newsletter

LDC at LREC-COLING 2024

New publications:
Call My Net 1

Automatic Content Extraction for Portuguese

______________________________________________________________________

LDC at LREC-COLING 2024
LDC will be exhibiting at LREC-COLING 2024 hosted by the European Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL) May 20-25 in Turin, Italy. Stop by our table to learn more about recent developments at the Consortium and the latest publications.

LDC staff members will also be presenting current work on topics including Spanless Event Annotation for Corpus-Wide Complex Event Understanding, Schema Learning Corpus: Data and Annotation Focused on Complex Events, and KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora.

LDC will post conference updates via social media. We look forward to seeing you in Italy!

New publications:

Call My Net 1 was developed by LDC and contains 364 hours of conversational telephone speech in four languages (Tagalog, Cebuano, Cantonese and Mandarin) collected in 2015 from 221 native speakers located in the Philippines and China along with metadata and speaker demographic information. Recordings and data from this collection were used to support the NIST 2016 Speaker Recognition Evaluation.

Speakers made 10 telephone calls each to people within their existing social networks, using different handsets and under a variety of noise conditions. Speakers were connected through a robot operator to carry on casual conversations on topics of their choice. All recordings were manually audited to confirm language and speaker requirements. The documentation for this release includes metadata about phone type, noise conditions and call quality. Speaker demographic information on year of birth, sex and native language is also included.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Automatic Content Extraction for Portuguese was developed at INESC TEC - Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência and consists of automatic Brazilian Portuguese and European Portuguese translations of the English text and annotations in ACE 2005 Multilingual Training Corpus (LDC2006T06).

ACE 2005 Multilingual Training Corpus was developed by LDC to support the Automatic Contract Extraction (ACE) program, specifically, by providing training data for the 2005 technology evaluation. It contains 1,800 files of mixed genre text in Arabic, English and Chinese annotated for entities, relations and events. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form. Text genres included newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech.

For this translation, the English data was partitioned into training, development and test sets. The documents were split into sentences and each event mention was assigned to its sentence. Source sentences and their annotations were translated into Brazilian Portuguese using Google Translate and into European Portuguese using DeepL Translate. An alignment algorithm and a parallel corpus word aligner were used to handle mismatches between translated annotations and their translated sentences.

2024 members can access this corpus through their LDC account. Non-members may license this data for a fee.