Linguistic Data Consortium: 2023 data scholarships

Showing posts with label 2023 data scholarships. Show all posts

Monday, October 16, 2023

LDC October 2023 Newsletter

Membership Year 2024 publication preview

Fall 2023 data scholarship recipients

New publications:

AIDA Scenario 1 Practice Topic Source Data

AIDA Scenario 1 and 2 Reference Knowledge Base

_______________________________________________________________

Membership Year 2024 publication preview
The 2024 membership year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

KASET: 147 hours of Sorani Kurdish and Kurmanji Kurdish conversational telephone speech and web broadcasts, 65 hours transcribed

AIDA topic source data and annotations: multimodal source data and annotations in multiple languages (Russian, Ukrainian, English, Spanish) for information and entity extraction

RATS Low Speech Density Data: 87 hours of Levantine Arabic, English, Persian, Pushto, and Urdu audio files selected from RATS speech activity detection and keyword spotting data sets, also including communications systems sounds and silence

Call My Net 1: 364 hours of conversational telephone speech recordings in Tagalog, Cebuano, Cantonese and Mandarin from speakers in the Philippines and China using various handsets under diverse noise conditions

Ravnursson Faroese Speech and Transcripts: 109 hours of read speech from 433 native speakers with transcripts

Diaspora Tibetan Speech: elicited, read and spontaneous speech from 73 native Tibetan speakers in Katmandu’s diaspora Tibetan community, some recordings transcribed

IARPA MATERIAL language packs: conversational telephone speech, transcripts, English translations, annotations and queries in multiple languages (e.g., Bulgarian, Somali, Georgian)

LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools in various languages (e.g., Farsi, Hungarian, Hindi, Amharic)

Check your inbox in the coming weeks for more information about membership renewal. 

Fall 2023 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2023 data scholarships:

Nessma Diab: Ain-Shams University (Egypt): Pre-PhD student, Linguistics. Nessma is awarded copies of CALLHOME Egyptian Arabic Speech LDC97S45 and CALLHOME Egyptian Arabic Transcripts LDC97T10 for her work in machine translation.

Soheir Elssakkout: Ain-Shams University (Egypt): PhD candidate. Soheir is awarded copies of Turkish Broadcast News and Transcripts LDC2012S06 and Middle East Technical University Turkish Microphone Speech v 1.0 LDC2006S33 for her work in speech recognition.

Metheus Franco: Witten/Herdecke University (Germany): Post-doctoral scholar, Faculty of Management, Economics and Society. Metheus is awarded a copy of Avocado Research Email Collection LDC2015T03 for his work in emotional foundations of dynamic capabilities.

Kamal Jarrar: Birzeit University (Palestine): Master’s student, Applied Statistics and Data Science Program. Kamal is awarded copies of Arabic Gigaword Fifth Edition LDC2011T11 and BOLT Arabic Discussion Forums LDC2018T10 for his work in part-of-speech tagging for dialectal Arabic.

Minkyoung Kim: Yonsei University (Korea); PhD candidate, Graduate School of Information. Minkyoung is awarded a copy of The New York Times Annotated Corpus LDC2018T19 for her work in event extraction and semantic event annotation.

Humaira Mehmood: Fatima Jinnah Women University (Pakistan): Master’s student, Computer Sciences. Humaira is awarded a copy of ARL Urdu Speech Database, Training Data LDC2007S03 for her work in machine translation.

Diyam Mousa: Birzeit University (Palestine): PhD candidate, Computer Science Department. Diyam is awarded copies of Arabic Treebank: Part 3 v. 3.2 LDC2010T08 and BOLT Egyptian Arabic Treebank – Discussion Forum LDC2018T23 for her work in morphological tagging for dialectal Arabic.

For information about the program, visit the Data Scholarships page.

New Publications

AIDA Scenario 1 Practice Topic Source Data was developed by LDC and is comprised of 1511 files (text, image, and video) from English, Russian, and Ukrainian web sources. Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 1 scenario focused on political relations between Russia and Ukraine in the 2010s. This corpus constitutes the full set of topic-focused documents for Phase 1 practice subtopics. Data was collected from web sources by a combination of automatic and manual processes.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.
The knowledge base for entity detection and linking annotation for all AIDA Scenario 1 and 2 corpora is available separately as AIDA Scenario 1 and 2 Reference Knowledge Base (LDC2023T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

AIDA Scenario 1 and 2 Reference Knowledge Base contains the English knowledge base (KB) used for all AIDA entity linking annotation in Scenario 1 (Russia-Ukraine Relations) and Scenario 2 (Crisis in Venezuela). The KB content was drawn from GeoNames, the CIA World Leaders List and the CIA World Factbook and was supplemented with manually-created KB entries developed by LDC specifically for AIDA data.

This knowledge base supported the AIDIA entity detection and linking task for 13 entity types: GPE (Geo-Political Entity), LOC (Location), PER (Person), ORG (Organization), FAC (Facility), MHI (Medical/Health Issue), WEA (Weapon), SID (Side), COM (Commodity), CRM (Crime), LAW (Law), VEH (Vehicle), and BAL (Ballot).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, December 15, 2022

LDC December 2022 Newsletter

LDC 2023 membership discounts now available

Approaching deadline for Spring 2023 data scholarship applications

30th Anniversary Highlight: AMR

New publications:

CAMIO Transcription Languages

Global TIMIT Thai

Third DIHARD Challenge Evaluation

________________________________________________________________

LDC 2023 membership discounts now available

Now through March 1, 2023, current 2022 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching deadline for Spring 2023 data scholarship applications

Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2023 data scholarships are due January 15, 2023. For more information on requirements and program rules, see LDC Data Scholarships.

30th Anniversary Highlight: AMR

Abstract Meaning Representation (AMR) annotation was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It is a semantic representation language that captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

LDC’s Catalog contains three cumulative English AMR publications: Release 1.0 (LDC2014T12), Release 2.0 (LDC2017T10), and Release 3.0 (LDC2020T02). The combined result in AMR 3.0 is a semantic treebank of roughly 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text and includes multi-sentence annotations.

LDC has also published Chinese Abstract Meaning Representation 1.0 (LDC2019T07) and 2.0 (LDC2021T13) developed by Brandeis University and Nanjing Normal University. These corpora contain AMR annotations for approximately 20,000 sentences from Chinese Treebank 8.0 (LDC2013T21). Chinese AMR follows the basic principles developed for English, making adaptations were necessary to accommodate Chinese phenomena.

Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07), developed by the University of Edinburgh, School of Informatics, consists of Spanish, German, Italian and Chinese Mandarin translations of a subset of sentences from AMR 2.0.
Visit LDC’s Catalog for more details about these publications.

New publications:

CAMIO Transcription Languages was developed by LDC and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition and related technologies for 35 languages across 24 unique script types.

Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes; 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in an XML output format defined for this corpus. Data for each language is partitioned into test, train or validation sets.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Global TIMIT Thai consists of 12 hours of read speech and time-aligned transcripts in Standard Thai from 50 speakers (33 female, 17 male) reading 120 sentences selected from the Thai National Corpus, the Thai Junior Encyclopedia, and Thai Wikipedia, for a total of 6000 utterances. Data was collected in 2016. Speakers were recruited in the Bangkok metropolitan area; they were native Thais, fluent in Standard Thai, and literate.

This data set was developed as part of LDC’s Global TIMIT project which aims to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Third DIHARD Challenge Evaluation was developed by LDC and contains 33 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.

The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.