Linguistic Data Consortium: word alignment

Showing posts with label word alignment. Show all posts

Friday, March 13, 2020

LDC 2020 March Newsletter

Spring 2020 LDC Data Scholarship recipients
LDC data and commercial technology development

New Publications:
BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training

__________________________________________________________________

Spring 2020 LDC Data Scholarship recipients

LDC congratulates the following Spring 2020 Data Scholarship recipients:

Zahra Azin (Istanbul Technical University, Turkey) is awarded a copy of Abstract Meaning Representation (AMR) Annotation Release 3.0 (LDC2020T02) for her work in Turkish AMR.
Spandan Dey (IIT Kharagpur, India) is awarded a copy of Multi-Language Conversational Telephone Speech – South Asian (LDC2017S14) for his research on automatic language recognition.
Jonathan Downey (University of California, Santa Barbara, US) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his research on second language acquisition and quantitative methodologies for educational measurements.
Nathaniel Fackler (University of Georgia, US) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his work on adult second language acquisition.
B. Senthil Kumar (SSN College of Engineering & Anna University, India) is awarded a copy of 2009 CoNLL Shared Task Part 2 (LDC2012T04) for his research on semantic role labeling.
Ming Li (Colorado School of Mines, US) is awarded a copy of TIDIGITS (LDC93S10) for her research on inferring speech signals from motion data in Internet of Things (IoT) security.
Jialiang Lin (Xiamen University, China) is awarded a copy of the ETS Corpus of Non-Native Written English (LDC2014T06) for his project to train and test an automated essay scoring model.

Students can learn more about the LDC Data Scholarship program and the next application cycle on the Data Scholarships page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

__________________________________________________________________

New publications:

(1) BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training was developed by LDC and consists of 153,171 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of transcripts of Egyptian Arabic conversational telephone speech (CTS) from LDC's CALLHOME and CALLFRIEND collections (LDC97S45, LDC97T19, LDC2002S37, LDC2002T38, LDC96S49) that was translated into English by professional translation agencies and annotated for the word alignment task.

The BOLT word alignment task was built on treebank annotation. Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC’s BOLT Egyptian Arabic Treebank, which had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.

BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) EVALution was developed by The Hong Kong Polytechnic University. It is comprised of English and Mandarin Chinese data sets -- EVALution 1.0 and EVALution-Man, respectively -- that contain semantic relations and metadata for training and evaluating distributional semantic models.

EVALution 1.0 consists of approximately 7500 English tuples extracted from ConceptNet 5.0 and WordNet 4.0 and filtered through automatic methods and crowd-sourcing. Several semantic relations between word pairs were instantiated, including hypernymy, synonymy, antonymy and meronymy. The corpus also includes additional information that can be used to filter the pairs or to analyze the results, such as relation domain, word frequency, word part-of-speech and word semantic field.

EVALution-MAN consists of Chinese word pairs from two sources: Chinese Wordnet and humans who completed an elicitation task by supplying missing words to sentences. The human-supplied sentence word pairs were then judged by human raters for reliability.

EVALution is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Mixer 4 and 5 Speech was developed by LDC and contains approximately 14,185 hours of audio recordings of conversational telephone speech, interviews, elicitation exercises and transcript readings involving 616 distinct speakers. The material was collected in 2007 as part of the Mixer project – which supported speaker recognition for a variety of research tasks – and recordings in this corpus were used in the 2008 NIST Speaker Recognition Evaluation.

The data in this release was collected by LDC at its Human Subjects Data Collection Laboratories in Philadelphia and by the International Computer Science Institute (ICSI) at the University of California, Berkeley, as a collaborative, carefully coordinated activity at both recording sites. The Mixer 4 and 5 collection contains 2,568 recordings made via the public telephone network and 2,152 sessions of multiple microphone recordings in office-room settings.

The telephone protocol connected recruited speakers through a robot operator to carry on casual conversations. In Mixer 4, 400 subjects made ten 10-minute calls; half of those subjects also visited one of the collection sites where they made two telephone calls while also being recorded on a cross-channel platform. In Mixer 5, 300 subjects each completed ten calls and six interview sessions at either LDC or ICSI; those sessions were conducted on a cross channel platform and included a telephone call in one of three vocal-effort conditions - normal, high and low. Mixer participants were nearly all native English speakers, the rest being bilingual English speakers.

This release includes metadata about the calls and speakers, along with time-aligned entries for many of the component portions of the recording sessions.

Mixer 4 and 5 Speech is distributed via hard drive.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

Friday, December 6, 2019

LDC 2019 December Newsletter

LDC Membership Discounts for MY2020 Still Available
Spring 2020 Data Scholarship Program – deadline approaching
Introducing LanguageArc: A Citizen Linguist Portal

New Publications:
MagicData Chinese Mandarin Conversational Speech
BOLT Egyptian Arabic-EnglishWord Alignment -- SMS/Chat Training
TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017
__________________________________________________________

LDC Membership Discounts for MY2020 Still Available

Join LDC while membership savings are still available. Now through March 2, 2020, current MY2019 members who renew their LDC membership receive a 10% discount off the membership fee. New or returning member organizations receive a 5% discount through March 2. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

Spring 2020 Data Scholarship Program – deadline approaching

Students can apply for the Spring 2020 Data Scholarship Program now through January 15, 2020. The LDC Data Scholarship program provides students with no-cost access to LDC data. For more information on application requirements and program rules, please visit LDC Data Scholarships.

Introducing LanguageArc: A Citizen Linguist Portal

LanguageARC is a citizen science website for languages developed with a grant from the National Science Foundation (no. 170377). Contributors to this online community – “citizen linguists” – participate in a variety of tasks and activities that support linguistic research, such as identifying accents from audio clips, recording “tongue twisters,” and translating English sentences into other languages. Data collected from LanguageArc will be made freely available to the research community. New collection and annotation projects will be added on an ongoing basis, and researchers will soon be able to create their own LanugageArc projects with an easy-to-use Project Builder Toolkit. All are encouraged to explore the site and participate in the community. Comments, questions and suggestions are welcome via the site’s Contact page.
___________________________________________________________

New publications:

(1) Magic Data Chinese Mandarin Conversational Speech was developed by Beijing Magic Data Technology Co., Ltd. and consists of approximately 10 hours of Mandarin conversational speech from 60 speakers. Each conversation was recorded on multiple devices and is presented in multiple forms, resulting in a total of approximately 60 hours of audio with corresponding transcripts.

All participants were native speakers of Mandarin in Mainland China from accent regions across the country. Speakers were paired for conversations on a range of topics, including travel, fitness, games, sports and pets. Metadata such as topic, collection date, mobile device and speaker demographic information is available in the documentation accompanying this release.

Magic Data Chinese Mandarin Conversational Speech is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training was developed by LDC and consists of 349,414 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

This release contains Egyptian Arabic source text message and chat conversations collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants. The source data is released as BOLT Egyptian Arabic SMS/Chat and Transliteration (LDC2017T07).

The BOLT word alignment task was built on treebank annotation. Egyptian Arabic source tree tokens were automatically extracted from tree files in LDC’s BOLT Egyptian Arabic Treebank, which had been tagged for part-of-speech and syntactically annotated. That data was then aligned and annotated for the word alignment task.

BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2016 and 2017. This corpus includes queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information for each of the queries. The EDL reference KB, to which EDL data are linked, is available separately in TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 (LDC2019T02).

The goal of the EDL track is to conduct end-to-end entity extraction, linking and clustering. For producing gold standard data, given a document collection, annotators (1) extract (identify and classify) entity mentions (queries), link them to nodes in a reference KB and (2) perform cross-document co-reference on within-document entity clusters that cannot be linked to the KB.

Source data for the annotations consists of Chinese, English and Spanish newswire and discussion forum text collected by LDC and is available in TAC KBP Evaluation Source Corpora 2016-2017 (LDC2019T12).

TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, September 17, 2019

LDC 2019 September Newsletter

LDC at Interspeech 2019

New Publications:
CALLFRIEND Canadian French Second Edition
BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training
Machine Reading Phase 1 NFL Scoring Training Data

_____________________________________________________________________

LDC at Interspeech 2019

LDC is exhibiting at Interspeech 2019, September 15-19 in Graz, Austria. Stop by Booth F16 to learn more about recent developments at the Consortium and new publications.

Be on the lookout for The Second DIHARD Speech Diarization Challenge (DIHARD II), a special session co-organized by LDC, and the following presentations featuring LDC work:

The Second DIHARD Diarization Challenge: Dataset - task - and baselines
Neville Ryant, Christopher Cieri, Mark Liberman (LDC), Kenneth Church (Baidu, USA), Alejandrina Cristia (Laboratoire de Sciences Cognitives et Psycholinguistique), Jun Du (University of Science and Technology of China), Sriram Ganapathy (Indian Institute of Science)
Oral Session, Tuesday September 17, 10:00 – 10:20, Hall 3

Automatic Detection of Prosodic Focus in American English
Sunghye Cho and Mark Liberman (LDC), Yong-cheol Lee (Cheongju University)
Poster Session, Wednesday September 18, 16:00 – 18:00, Gallery B

Automatic detection of ASD in children using acoustic and text features from brief natural conversations
Sunghye Cho, Mark Liberman, Neville Ryant (LDC), Meredith Cola, Robert T. Schultz, Julia Parish-Morris (Children's Hospital of Philadelphia)
Oral Session, Wednesday September 18, 16:45 – 17:00, Hall 3

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

New publications:

(1) CALLFRIEND Canadian French Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Canadian French. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Canadian French (LDC96S48).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Canadian French Second Edition is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training was developed by LDC for the DARPA BOLT (Broad Operational Language Translation) program and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

This release consists of Chinese source text and chat conversations collected using two methods: new collection via LDC's collection platform and donation of SMS and chat archives from BOLT collection participants. The source data is released as BOLT Chinese SMS/Chat (LDC2018T15).

The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment, as well as tokenized for character alignment by inserting white spaces to separate characters.

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Machine Reading Phase 1 NFL Scoring Training Data was developed by LDC for use in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. It contains 110 U.S. NFL (National Football League) scoring source documents and 110 standoff annotation files, manually annotated for instances of NFL Scoring annotation categories defined with respect to a NFL Scoring ontology.

The Machine Reading program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the NFL Scoring Use Cases evaluation, which tested the sports domain by extracting information about scoring events and game outcomes and aligning that information with an NFL Scoring ontology.

Machine Reading Phase 1 NFL Scoring Training Data is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, September 16, 2015

LDC 2015 September Newsletter

New Publications

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4

GALE Phase 3 and 4 Arabic Newswire Parallel Text

NewSoMe Corpus of Opinion in News Report

_______________________________________________________________

New Publications

(1) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 was developed by LDC and contains 243,038 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality.

This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:

Language	Genre	Files	Words	CharTokens	Segments
Chinese	BC	69	67,782	101,674	2,276
Chinese	BN	29	94,242	141,364	3,152
Total		98	162,024	243,038	5,428

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging eight different types of links
Identifying, attaching, and tagging local-level unmatched words
Identifying and tagging sentence/discourse-level unmatched words
Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 3 and 4 Arabic Newswire Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from newswire data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

This data includes 551 source-translation document pairs, comprising 156,775 tokens of Arabic source text and its English translation. Data is drawn from seven distinct Arabic newswire sources: Agence France Presse, Al Ahram, Al Hayat, Al-Quds Al-Arabi, An Nahar, Asharq Al-Awsat and Assabah.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. The transcribed and segmented files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations. Source data and translations are distributed in TDF format.

GALE Phase 3 and 4 Arabic Newswire Parallel Text is distributed via web download.

(3) NewSoMe Corpus of Opinion in News Reports was compiled at Barcelona Media and consists of Spanish, Catalan and Portuguese news reports annotated for opinions. It is part of the NewSoMe (News and Social Media) set of corpora presenting opinion annotations across several genres and covering multiple languages. NewSoMe is the result of an effort to build a unifying annotation framework for analyzing opinion in different genres, ranging from controlled text, such as news reports, to diverse types of user-generated content that includes blogs, product reviews and microblogs.

The source data in this release was obtained from various newspaper websites and consists of approximately 200 documents in each of Spanish, Catalan and Portuguese. The annotation was carried out manually through the crowdsourcing platform CrowdFlower with seven annotations per layer that were aggregated for this data set. The layers annotated were topic, segment, cue, subjectivity, polarity and intensity.

NewSoMe Corpus of Opinion in News Reports is distributed via web download.