Linguistic Data Consortium: Treebank

Showing posts with label Treebank. Show all posts

Thursday, October 14, 2021

LDC October 2021 Newsletter

Fall 2021 data scholarship recipients

Membership Year 2022 publication preview

LDC data and commercial technology development

New Publications:

UCLA Variability Speaker Database

BOLT Egyptian Arabic Treebank – SMS/Chat

_______________________________________________________

Fall 2021 data scholarship recipients

Congratulations to the recipients of LDC's Fall 2021 data scholarships:

Sophia Minnillo: University of California, Davis (USA); PhD, Linguistics. Sophia is awarded a copy of ETS Corpus of Non-Native Written English LDC2014T06 for her research on the use of transition markers by Chinese L1 speakers.

Jagabandhu Mishra: Indian Institute of Technology Dharwad (India); Research Scholar, Electrical Engineering. Jagabandhu is awarded a copy of Mandarin-English Code-Switching in South-East Asia LDC2015S04 for his work in spoken language diarization.

Kashyap Patel: University of Texas at Dallas (USA); Ph.D., Electrical Engineering. Kashyap is awarded copies of CSR-I (WSJ0) Sennheiser LDC93S6B and CSR-II (WSJ1) Sennheiser LDC94S13B for his research in audio, acoustic and speech signal processing.

Yoshani Ranaweera, D. Dissanayaka, S. Sudasinghe: University of Moratuwa (Sri Lanka); Bachelors, Computer Science and Engineering. This group is awarded a copy of CALLHOME American English Speech LDC97S4 for their work in speaker diarization.

Winie Wong: University of Illinois at Chicago (USA); PhD, Electrical and Computer Engineering. Winie is awarded copies of ISI Chinese-English Automatically Extracted Parallel Text LDC2007T09 and GALE Phase 3 and 4 Chinese Broadcast News Parallel Text LDC2016T15 for her research in machine translation.

For information about the program, visit the Data Scholarships page.

Membership Year 2022 publication preview

The 2022 Membership Year is approaching and plans for next year’s publications are in progress. Among the expected releases are:

2017 NIST OpenSAT Pilot – SSSF: real world operational English speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition, and keyword search tasks of the 2017 OpenSAT Pilot evaluation
AttImam: 2000 attribution relations applied to Arabic newswire text from Arabic Treebank: Part 1 v 4.1 LDC2010T13
Samrómur Icelandic Speech: 145 hours of Icelandic prompted speech from 8000 speakers covering text from novels, news, plays, and location names
MASRI Synthetic: 99 hours of synthesized Maltese speech from various genres with transcripts
HAVIC MED Novel Tests: thousands of hours of event and background user-generated videos with annotation and metadata used for the 2015 Multimedia Event Detection task
DIHARD Challenges: development and evaluation data from the second and third DIHARD evaluations, a set of shared tasks focusing on speech diarization for challenging data
LORELEI: representative and incident language packs containing monolingual text, bi-text, translations, annotations, supplemental resources and related tools (Kinyarwanda, Wolof)

Check your inbox in the coming weeks for more information about membership renewal. 

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) UCLA Variability Speaker Database was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of approximately 34 hours of English speech and orthographic transcripts. Speakers (101 female, 101 male) took part in six tasks: vowel sounds, reading sentences, giving instructions, neutral conversation, happy conversation, a phone conversation, annoyed conversation, and responding to a video. This corpus was designed to sample variability in speaking within individual speakers and across a large number of speakers.

UCLA Variability Speaker Database is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus. 2021 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic Treebank – SMS/Chat was developed by LDC and consists of Egyptian Arabic SMS/Chat data with part-of-speech annotation, morphology, and syntactic tree annotation. This release contains 349,414 tokens before clitics were split and 435,677 tree tokens after clitics were split for treebank annotation. The source data was collected by LDC from its collection platform or by donation and was manually reviewed to exclude material not in the target language or with sensitive content. Originally written in Arabizi (Romanized/Latin characters) script, the source SMS/chat text was transliterated to Arabic script and manually corrected prior to treebank annotation. Annotations followed Penn Arabic Treebank guidelines.

BOLT Egyptian Arabic Treebank – SMS/Chat is distributed via web download.

Friday, May 15, 2020

LDC 2020 May Newsletter

New Publications:

LORELEI Oromo Incident Language Pack

LORELEI Entity Detection and Linking Knowledge Base
BOLT English Translation Treebank - Chinese Discussion Forum

Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese

_______________________________________________________________

New publications:

(1) LORELEI Oromo Incident Language Pack was developed by LDC and is comprised of approximately 3.9 million words of Oromo monolingual text, 25,000 words of English monolingual text, 135,000 words of parallel and comparable Oromo-English text, and 50,000 words of data annotated for Entity Discovery and Linking and Situation Frames. It contains all of the text data, annotations, supplemental resources and related software tools for the Oromo language that were used in the DARPA LORELEI / LoReHLT 2017 Evaluation.

The evaluation protocol was based on a scenario in which an unforeseen event triggered a need for humanitarian and logistical support in a region where the incident language had received little or no attention in NLP research. Evaluation participants provided NLP solutions, including information extraction and machine translation, with limited resources and limited development time.

Data was collected from news, social network, weblog, newsgroup, discussion forum, and reference material. Entity Detection and Linking and Situation Frame annotations identified “entities,” “needs” (such as a need for food) and “issues” (such as civil unrest) to be detected by systems for scoring purposes. Situation frame analysis was designed to extract basic information that would be useful for planning a disaster response effort.

The knowledge base for the entity linking annotation in this corpus is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

LORELEI Oromo Incident Language Pack is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(1) LORELEI Entity Detection and Linking Knowledge Base was developed by LDC and contains the full LORELEI Entity Detection and Linking (EDL) Knowledge Base (KB) used for all LORELEI Representative Language and Incident Language Pack entity linking annotation. The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.

The KB in this release supported the EDL task in LORELEI for four entity types -- geo-political entities (GPE), locations (LOC), persons (PER) and organizations (ORG) -- and contains a total of 10,216,832 entities. There are four inputs to the KB, each designated by a unique "origin" code in the KB, as follows: GPE and LOC entities from a snapshot of GeoNames, PER entities from the CIA World Leaders List, ORG entities from Appendix B of the CIA World Factbook, and additional entities manually created by LDC for each of the representative and incident languages in the LORELEI Program.

LORELEI Entity Detection and Linking Knowledge Base is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) BOLT English Translation Treebank - Chinese Discussion Forum was developed by LDC and consists of 147,432 tokens of web discussion forum data translated from Chinese to English and annotated for part-of-speech and syntactic structure.

The source data is Chinese discussion forum web text collected by LDC in 2011 and 2012, translated into English and released in BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05). A subset of the translated text -- 148 files representing 147,432 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. Only the translated English text is included in the source data for this release.

Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.

BOLT English Translation Treebank - Chinese Discussion Forum is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese was developed by LDC and is comprised of approximately 25 hours of telephone speech in Mandarin Chinese.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. The data was collected using LDC's telephone collection infrastructure, comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Slavic Group (LDC2016S11)
Turkish (LDC2017S09)
South Asian (LDC2017S14)
Central Asian (LDC2018S03)
Central European (LDC2018S08)
Spanish (LDC2018S12)
Arabic (LDC2019S02)
English (LDC2019S06)
East Asian (LDC2019S15)

Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.