LDC 2023 membership discounts now available
Approaching deadline for Spring 2023 data scholarship applications
30th Anniversary Highlight: AMR
New publications:
Third DIHARD Challenge Evaluation
________________________________________________________________
LDC 2023 membership discounts now available
Now through March 1, 2023, current 2022 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.
Approaching deadline for Spring 2023 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2023 data scholarships are due January 15, 2023. For more information on requirements and program rules, see LDC Data Scholarships.
30th Anniversary Highlight: AMR
Abstract Meaning Representation (AMR) annotation was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It is a semantic representation language that captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
LDC’s Catalog contains three cumulative English AMR publications: Release 1.0 (LDC2014T12), Release 2.0 (LDC2017T10), and Release 3.0 (LDC2020T02). The combined result in AMR 3.0 is a semantic treebank of roughly 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text and includes multi-sentence annotations.
LDC has also published Chinese Abstract Meaning Representation 1.0 (LDC2019T07) and 2.0 (LDC2021T13) developed by Brandeis University and Nanjing Normal University. These corpora contain AMR annotations for approximately 20,000 sentences from Chinese Treebank 8.0 (LDC2013T21). Chinese AMR follows the basic principles developed for English, making adaptations were necessary to accommodate Chinese phenomena.
Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07), developed by the University of Edinburgh, School of Informatics, consists of Spanish, German, Italian and Chinese Mandarin translations of a subset of sentences from AMR 2.0.
Visit LDC’s Catalog for more details about these publications.
CAMIO Transcription Languages was developed by LDC and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition and related technologies for 35 languages across 24 unique script types.
Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes; 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in an XML output format defined for this corpus. Data for each language is partitioned into test, train or validation sets.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Global TIMIT Thai consists of 12 hours of read speech and time-aligned transcripts in Standard Thai from 50 speakers (33 female, 17 male) reading 120 sentences selected from the Thai National Corpus, the Thai Junior Encyclopedia, and Thai Wikipedia, for a total of 6000 utterances. Data was collected in 2016. Speakers were recruited in the Bangkok metropolitan area; they were native Thais, fluent in Standard Thai, and literate.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Third DIHARD Challenge Evaluation was developed by LDC and contains 33 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge.
The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.