LDC/Penn receives US Dept of Education research grant
Membership year 2025 publication preview
Fall 2024 data scholarship recipients
New publications:
__________________________________________________________________
LDC/Penn receives US Dept of Education research grant
LDC and Penn’s Graduate School of Education and Department of Computer and Information Science are part of a team that was recently awarded a $10 million grant from the US Department of Education to develop the Using Generative Artificial Intelligence for Reading R&D Center (U-GAIN Reading) which will explore using generative AI to improve elementary school reading instruction for English learners. Led by the education nonprofit Digital Promise, U-GAIN Reading will build on an existing research-based tutoring platform, Amira Learning, that is used by more than 1 million students each year. The LDC/Penn team will contribute expertise in computational linguistics, computer science, and learning analytics. An evaluation team at MDRC will measure learner outcomes both to improve the R&D and to benchmark its eventual impacts. Additional experts in the science of reading, ethics, and strategies for national impact will support the project’s work. Data developed in the project will be shared with the community through the LDC Catalog.
Membership year 2025 publication preview
The 2025 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:
- Iraqi Arabic – English Lexical Database: a set of six interrelated tables (roots, lemmas, wordforms, multi-word expressions, English definitions, example phrases) presenting each Iraqi Arabic word in Arabic script and IPA format, a result of LDC’s collaboration with Georgetown University Press to enhance and update three dialectal Arabic dictionaries
- AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, English, Spanish) for information and entity extraction
- 2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of conversational telephone speech and broadcast narrow band speech in six linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) representing 20 languages, used in NIST’s 2015 language recognition evaluation
- BOLT CALLFRIEND CALLHOME CTS Audio, Transcripts and Translations: previously unpublished Chinese and Egyptian Arabic telephone conversations from the CALLFRIEND and CALLHOME collections, with transcripts and translations developed by LDC for the DARPA BOLT program
- Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from ancient and modern Chinese texts with syntactic annotation based on sentence constituent analysis, developed by Beijing Normal University and Peking University
- IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Georgian, Kazakh, Lithuanian)
- LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali)
Fall 2024 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2024 data scholarships:
Yomma Gamaleldin: Alexandria University (Egypt): Master’s student, Computer and Systems Engineering Department. Yomma is awarded a copy of Qatari Corpus of Argumentative Writing LDC2022T04 for her work in Arabic automated essay scoring.
Arhane Mahaganapathy: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Ahrane is awarded copies of IARPA Babel Tamil Language Pack LDC2017S13 and Multi-Language Telephone Speech 2011 – South Asian LDC2017S14 for her work in Tamil speech-to-text systems.
Sivashanth Suthakar: Jaffna University (Sri Lanka): Master’s student, Department of Computer Science. Sivashanth is awarded copies of CAMIO Transcription Languages LDC2022T07 and LORELEI Tamil Representative Language Pack LDC2023T03 for his work in Tamil OCR systems.
Oshan Yalegama: University of Moratuwa (Sri Lanka): BSc, Electronic and Telecommunication Engineering. Oshan is awarded copies of CSR-I (WSJ0) Complete LDC93S6A and TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 for his work in audio signal processing.
Samer Mohammed Yaseen: Sana’a University (Yemen): PhD candidate, Faculty of Computer and Information Technology. Samer is awarded a copy of Arabic Newswire Part 1 LDC2001T55 for his work in Arabic information retrieval.
MultiTACRED was developed by the German Research Center for Artificial Intelligence (DFKI) Speech and Language Technology Lab and is a machine translation of TAC Relation Extraction Dataset (LDC2018T24) (TACRED) into twelve languages with projected entity annotations. TACRED is a large-scale relation extraction dataset containing 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The training and evaluation data for the TAC KBP slot filling tasks was developed by the Linguistic Data Consortium.