Linguistic Data Consortium: Chinese Treebank

Showing posts with label Chinese Treebank. Show all posts

Monday, June 16, 2025

LDC June 2025 Newsletter

LDC data and commercial technology development

New publications:

Chinese Sentence Pattern Structure Treebank
IWSLT 2022-2023 Shared Task Training, Development and Test Set
KAIROS Schema Learning Complex Event Annotation

_______________________________________________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Chinese Sentence Pattern Structure Treebank was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works. There are three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer tool which is included in the release.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

IWSLT 2022 - 2023 Shared Task Training, Development and Test Set was developed by LDC and contains 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation. This material constitutes the training, development and test data used in the International Conference on Spoken Language Translation (IWSLT) Dialectal Speech Translation task (2022) and the Dialectal and Low-resource track (2023).

The telephone speech was collected by LDC in 2016-2017 from native speakers of Tunisian Arabic in Tunis. Speakers were recruited to make telephone calls to people in their social networks from a variety of noise conditions and handsets. Transcripts are orthographic following Buckwalter transliteration and cover 175 hours of the collected speech. IPA transcripts were added to a subset of the data.

All transcribed segments were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

KAIROS Schema Learning Complex Event Annotation was developed by LDC to support the DARPA KAIROS program. It contains English and Spanish text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance. Source data was collected from the web; 3431 root web pages were collected and processed, yielding 1919 text data files, 24019 image files, 1472 video files and 16 audio files.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Thursday, July 15, 2021

LDC July 2021 Newsletter

LDC Submissions: a new platform for sharing data through LDC

Fall 2021 LDC Data Scholarship Program

New Publications:
Ethnobotanical Research and Language Documentation of Nahuatl
Chinese Abstract Meaning Representation 2.0
BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech

LDC Submissions: a new platform for sharing data through LDC
LDC is pleased to announce the launch of LDC Submissions, a platform that provides infrastructure and resources for sharing data through the Catalog. After registering for a user account, corpus submitters can create a submission, upload files, and communicate with LDC’s publications team during the review process. After all reviews are complete, the final, release-ready version of your data set is uploaded to the platform and enters the publications queue.

Sharing your corpus through LDC ensures access to the global research community and the permanent preservation of your data according to best practices for archiving digital language resources. Get started and register for an LDC Submissions user account today.

Fall 2021 LDC Data Scholarship Program
Student applications for the Fall 2021 LDC Data Scholarship program are being accepted now through September 15, 2021. This program provides eligible students with no-cost access to LDC data. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, visit the LDC Data Scholarship page.

New publications:
(1) Ethnobotanical Research and Language Documentation of Nahuatl consists of approximately 190 hours of field recordings collected in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico. The corpus contains audio and video recordings of native Nahuatl speakers during the collection of particular plants; partial transcripts (Nahuatl and Spanish); a Highland Puebla Nahuat dictionary; botanical and ethnobotanical data; and speaker metadata.

Nahuatl is one of the most widely spoken indigenous languages in the Americas with approximately 1.5 million speakers in Mexico. Many distinct and sometimes mutually intelligible varieties have been recognized. The recordings in this release were collected between 2008 and 2019 in two different municipalities: Cuetzalan del Progreso and Tepetzintla. Speech from Cuetzalan represents Highland Puebla Nahuat, and speech from Tepetzintla represents Zacatlán-Ahuacatlám-Tepetzintla Nahuatl.

The recordings consist of a speaker talking about a plant's nomenclature, classification, and use. Transcripts are included for the Cuetzalan recordings; these transcripts have been partially translated into Spanish. A Highland Puebla Nahuat dictionary is included in both text and Toolbox XML formats. Botanical and ethnobotanical information is presented as a collection of pdfs, and images as jpegs.

Ethnobotanical Research and Language Documentation of Nahuatl is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Chinese Abstract Meaning Representation 2.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of approximately 20,000 Chinese sentences from Chinese Treebank (CTB) 8.0 (LDC2013T21). CAMR 2.0 includes the content of Chinese Abstract Meaning Representation 1.0 (LDC2019T07) (CTB 8.0 weblog and discussion forum sentences), plus an additional 9,933 sentences from the newswire portion of CTB 8.0.

Abstract Meaning Representation (AMR) captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole sentence meaning in a tree structure. Chinese AMR is constructed following the basic principles developed for English: a compact, readable, whole-sentence semantic representation, while making adaptions where necessary to handle Chinese-specific phenomena.

The corpus contains 20,078 sentences from the weblog, discussion forum, and newswire portions of CTB 8.0. Three sets of files are included: the original Chinese AMR data with concept-to-word and relation-to-word alignments, a converted English AMR format, and a Chinese syntactic dependency tree format. Each set is divided into training, development and test sets.

Chinese Abstract Meaning Representation 2.0 is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies. Co-reference annotation aims to fill in the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers, and verbs.

The source discussion forum data and SMS/Chat data was collected by LDC for the DARPA BOLT program. The telephone data was taken from LDC's Egyptian Arabic CALLHOME and CALLFRIEND telephone collections.

BOLT Egyptian Arabic Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, June 16, 2016

LDC June 2016 Newsletter

Commercial use and LDC data

New publications:

Chinese Treebank 9.0

CHM150

GALE Phase 4 Arabic Weblog Parallel Sentences

_______________________________________________________________

Commercial use and LDC data

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for more information.

New Corpora

(1) Chinese Treebank 9.0 consists of approximately two million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech. This new data set in the Chinese Treebank series adds more annotated web data and two new genres – chat messages and transcribed telephone speech.

There are 3,726 text files in this release, containing 132,076 sentences, 2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is provided in the UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked.

Chinese Treebank 9.0 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker metadata. The goal of this work was to support spoken term detection and forensic speaker identification.

This corpus is comprised of Mexican Spanish microphone speech from 75 male speakers and 75 female speakers in a quiet office environment. Speakers could answer pre-selected open questions or describe a particular painting shown to them on a computer monitor. Speaker metadata in this release includes age, gender, place of birth, place of residence and parents' nationalities.

CHM150 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.

(3) GALE Phase 4 Arabic Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations, selected from newsgroup and weblog data collected by LDC and translated by LDC or under its direction.

The data includes 1,067 source-translation document pairs, comprising 68,346 words (Arabic source) of translated data.

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 4 Arabic Weblog Parallel Sentences is distributed via web download.