Linguistic Data Consortium: December 2024

Monday, December 16, 2024

LDC December 2024 Newsletter

LDC 2025 membership discounts now available

Approaching deadline for Spring 2025 data scholarship applications

LDC closed for Winter Break December 25-January 1

New publications:

MATERIAL Farsi-English Language Pack
Abstract Meaning Representation 3.0 – Machine Translations

LDC 2025 membership discounts now available
Now through March 3, 2025, current 2024 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching deadline for Spring 2025 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2025 data scholarships are due January 15, 2025. For more information on requirements and program rules, see LDC Data Scholarships.

LDC closed for Winter Break December 25-January 1
LDC will be closed from Wednesday, December 25, 2024, through Wednesday, January 1, 2025, in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Thursday, January 2, 2025. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:
MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA MATERIAL program and contains 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations, and queries. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments. Transcripts cover approximately 30% of the speech files, and approximately 3% of the speech files were translated into English. This release also includes English queries and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries.

2024 members can access this corpus through their LDC accounts provided they have submitted a completed copy of the special license agreement. Non-members may license this data for a fee.

Abstract Meaning Representation 3.0 - Machine Translations was developed by the Center for Computational Linguistics at KU Leuven in the HORIZON2020 project SignON. It is an automatic translation of a subset of sentences from Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02) into Spanish, Irish Gaelic, and Dutch.

AMR 3.0 training, development, and test splits were translated using Google Translate. "Unsplit" directories were not translated and are not included in this release. Translations were not manually verified, but formal issues (such as unexpected new lines) were corrected, and special tokens and encoding issues were fixed with the Python tool ftfy.fix_text.

AMR 3.0 is a semantic treebank of over 59,000 English natural language sentences drawn from material collected by LDC, specifically, discussion forum text from the DARPA BOLT and DARPA DEFT programs, transcripts and English translations of Mandarin Chinese broadcast news programming, Wall Street Journal text, translated Xinhua news texts, various newswire texts from NIST OpenMT evaluations, and weblog data from the DARPA GALE program.

2024 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.