Linguistic Data Consortium: English Discussion Forums

Showing posts with label English Discussion Forums. Show all posts

Tuesday, December 15, 2020

LDC 2020 December Newsletter

LDC 2021 Membership Discounts Now Available
Approaching Deadline for Spring 2021 Data Scholarship Applications
LDC Closed for Winter Break Dec. 24- Jan. 5

New Publications:
BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech
Phonemes of Arabic
Global TIMIT Mandarin Chinese – Guanzhong Dialect

_______________________________________________________________________

LDC 2021 Membership Discounts Now Available
Now through March 1, 2021, current 2020 members receive a 10% discount for renewing their membership and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits.

Approaching Deadline for Spring 2021 Data Scholarship Applications
Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2021 data scholarships are due January 15, 2021. For more information on requirements and program rules, see LDC Data Scholarships.

LDC Closed for Winter Break Dec. 24-Jan. 5
LDC will be closed from Thursday, December 24, 2020 through Tuesday, January 5, 2021 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Wednesday, January 6, 2021. Requests received by the Membership Office during Winter Break will be processed when the office reopens.

New publications:
(1) BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies for the BOLT co-reference task and consists of co-reference annotation on English discussion forum, SMS/Chat, and conversational telephone speech.

Co-reference annotation aims to fill in connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation and covers noun phrases (including proper nouns, nominals, pronouns, and null arguments), possessives, proper noun pre-modifiers, and verbs.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference.

BOLT English Co-reference – Discussion Forum, SMS/Chat, and Conversational Telephone Speech is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Phonemes of Arabic was developed at the Florida Institute of Technology. It contains approximately one hour of speech from native Arabic speakers that includes all Arabic sounds (consonants and vowels) and 24 words with specific consonant-vowel patterns.

Arabic has three short vowels, three long vowels, and 28 consonants. Speakers recorded all sounds, repeating each sound three times. Each speaker also recorded 24 Arabic words with a specified consonant-vowel pattern and repeated each word three times. The speakers (19 male) were from the following countries: Egypt, Iraq, Lebanon, Libya, Morocco, Saudi Arabia, and Syria.

Phonemes of Arabic is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) Global TIMIT Mandarin Chinese – Guanzhong Dialectwas developed by LDC and Xi’an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province. It is comprised of 50 speakers reading 120 sentences from Chinese Gigaword Fifth Edition (LDC2011T13). Among the 120 sentences, 20 sentences were read by all speakers, 40 sentences were read by 10 speakers, and 60 sentences were read by one speaker, for a total of 3220 sentence types.

The corpus was recorded at Xi’an Jiaotong University, Xi’an, China. Speakers (25 female, 25 male) were born in Weinan, Shannxi and spoke the Guanzhong dialect.

The Global TIMIT project aimed to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT data set, which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Specifically, those features include:

A large number of fluently read sentences, containing a representative sample of phonetic, lexical, syntactic, semantic, and pragmatic patterns
A relatively large number of speakers
Time-aligned lexical and phonetic transcription of all utterances
Some sentences read by all speakers, others read by a few speakers, and others read by just one speaker

Global TIMIT Mandarin Chinese – Guanzhong Dialect is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, October 15, 2019

LDC 2019 October Newsletter

Membership Year 2020 Publication Preview

LDC data and commercial technology development

New Publications:
BOLT English Treebank - Discussion Forum
Polish Speech Database
2016 NIST Speaker Recognition Evaluation Test Set
______________________________________________________________

Membership Year 2020 Publication Preview

The 2020 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:

Abstract Meaning Representation (AMR) Annotation Release 3.0: semantic treebank of over 59,000 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums; updates the second version (LDC2017T10) with new annotations

TAC KBP: English sentiment slot filling, surprise slot filling, nugget detection and coreference, and event argument data in all languages (English, Chinese and Spanish)

DEFT Chinese ERE: Chinese discussion forum data annotated for entities, relations and events

LibriVox Spanish: 73 hours of Spanish audiobook read speech and transcripts

IARPA Babel Language Packs (telephone speech and transcripts): languages include Dhuluo, Javanese and Mongolian

HAVIC Med Training data: web video, metadata, and annotations for developing multimedia systems

RATS Speaker Identification: conversational telephone speech in Levantine Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation of speech segments for speaker identification

BOLT: discussion forums, SMS/chat, conversational telephone speech, word-aligned, tagged and co-reference data in all languages (Chinese, Egyptian Arabic, and English)

Check your inbox in the coming weeks for more information about membership renewal.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

______________________________________________________________

New publications:

(1) BOLT English Treebank - Discussion Forum was developed by LDC and consists of 268,907 tokens of English web discussion forum data with part-of-speech and syntactic structure annotations collected for the DARPA BOLT (Broad Operational Language Translation) program.

Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.

The source data is English discussion forum web text collected by LDC in 2011 and 2012. A subset of that data -- 702 files representing 268,907 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. The unannotated English source data is released as BOLT English Discussion Forums (LDC2017T11).

BOLT English Treebank - Discussion Forum is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Polish Speech Database was developed by VoiceLab and consists of 263,424 utterances of Polish speech data from 200 speakers, totaling approximately 280 hours, and corresponding transcripts.

Data collection was performed in Poland. Speakers were asked to record themselves reading text on a website for at least 60 minutes from their home computer while using a headset. The read text was comprised of sentences covering most speech sounds in Polish.

This release includes speaker metadata. There were 103 male speakers and 97 female speakers, ranging from 15 – 60 years of age; most speakers were in the 15 – 30 years age range.

Polish Speech Database is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) 2016 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology) and contains approximately 340 hours of short segments of Tagalog, Cantonese, Cebuano and Mandarin telephone speech used as development and test data in the NIST-sponsored 2016 Speaker Recognition Evaluation (SRE).

As in previous evaluations, SRE16 focused on telephone speech recorded over a variety of handset types for the training and test conditions. In addition to development and evaluation data, this corpus also contains trial lists, their associated keys, tables containing metadata information, and evaluation documentation.

The telephone speech data was drawn from the Call My Net 2015 Corpus collected by LDC. Native speakers of Tagalog, Cantonese, Cebuano or Mandarin (220 unique speakers) made a total of ten telephone calls each to people within their existing social networks. Speakers were encouraged to use different telephone instruments in a variety of acoustic settings and were instructed to talk for 8 - 10 minutes per call on a topic of their choice. All conversations were collected outside North America.

2016 NIST Speaker Recognition Evaluation Test Set is distributed via web download.

Wednesday, July 19, 2017

LDC July 2017 Newsletter

LDC at ACL 2017

Fall 2017 Data Scholarship Program

New corpora:

BOLT English Discussion Forums

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
KSUEmotions
Metalogue Multi-Issue Bargaining Dialogue
_________________________________________________________________________

LDC at ACL 2017: July 31-August 2, Vancouver, Canada

ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Vancouver, Canada. Stop by our exhibition table to learn more about recent developments at the Consortium and new publications.

Fall 2017 Data Scholarship Program

Student applications for the Fall 2017 LDC Data Scholarship program are being accepted now through Friday, September 15, 2017, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page.

Applicants can email their materials to the LDC Data Scholarship program.

New corpora

(1) BOLT English Discussion Forums was developed by LDC and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

The material in this release represents the unannotated English source data in the discussion forum genre. Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-English content. Language identification was performed on all threads in this corpus (using CLD2).

BOLT English Discussion Forums is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains 200 hours of Tamil conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Tamil speech in this release represents that spoken in the Northern, Central, Southern and Western dialect regions of the Indian state of Tamil Nadu. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) KSUEmotions was developed by King Saud University (KSU) and contains approximately five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects. Speakers were from three countries: Yemen, Saudi Arabia and Syria.

Subjects read MSA sentences from newswire text in the following emotions: neutral, anger, sadness, happiness, surprise, and interrogative (asking a question). Human reviewers then listened to the recordings to identify the emotion they heard. Audio was recorded in each participant's home.

KSUEmotions is distributed via web download.

(4) Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.

The goal of the Metalogue project was to develop a dialogue system with flexible dialogue management to enable the system's behavior in setting goals, choosing strategies and monitoring various processes. Six unique subjects (undergraduates between 19 and 25 years of age) were involved in a multi-issue bargaining scenario in which a representative of a city council and a representative of small business owners negotiated the implementation of new anti-smoking regulations. The negotiation involved four issues, each with four or five options. Participants received a preference profile for each scenario and negotiated for an agreement with the highest value based on their preference information. Negotiators were not allowed to accept an agreement with a negative value or to share their preference profiles with other participants.

The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction.

Metalogue Multi-Issue Bargaining Dialogue is distributed via web download.