Friday, November 15, 2019

LDC 2019 November Newsletter

Join LDC for Membership Year 2020
Spring 2020 Data Scholarship Program
_________________________________________________________________________ 

Join LDC for Membership Year 2020 

Membership Year 2020 (MY2020) is open and discounts are available for those who keep their membership current and join early in the year. Now through March 2, 2020, current MY2019 members who renew their LDC membership before March 2 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 2.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 800 holdings. Current-year for-profit members may use most data for commercial applications.

Plans for MY2020 publications are in progress. Among the expected releases are: 

Abstract Meaning Representation (AMR) Annotation Release 3.0: semantic treebank of over 59,000 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums; updates the second version (LDC2017T10) with new annotations 
TAC KBP: English sentiment slot filling, surprise slot filling, nugget detection and coreference, and event argument data in all languages (English, Chinese and Spanish) 
DEFT Chinese ERE: Chinese discussion forum data annotated for entities, relations and events 
LibriVox Spanish: 73 hours of Spanish audiobook read speech and transcripts
IARPA Babel Language Packs (telephone speech and transcripts): languages include Dhuluo, Javanese and Mongolian 
HAVIC Med Training data: web video, metadata, and annotations for developing multimedia systems 
RATS Speaker Identification: conversational telephone speech in Levantine Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation of speech segments for speaker identification 
BOLT: discussion forums, SMS/chat, conversational telephone speech, word-aligned, tagged and co-reference data in all languages (Chinese, Egyptian Arabic, and English) 

It’s also not too late to join for MY2018 (through December 31, 2019) and MY2019 (through December 31, 2020). Data sets from those years include Concretely Annotated New York Times and English Gigaword, DIRHA English WSJ Audio, BOLT English Treebank – Discussion Forum, First DIHARD Challenge Development and Evaluation releases, Penn Discourse Treebank Version 3.0, and 2016 NIST Speaker Recognition Evaluation Test Set. 

For full descriptions of all LDC data sets, browse our Catalog.

Visit Join LDC for details on membership, user accounts and payment. 

Spring 2020 Data Scholarship Program 

Applications are now being accepted through January 15, 2020 for the Spring 2020 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.
_________________________________________________________________________  

New publications: 

(1) DEFT English Committed Belief Annotation was developed by LDC and consists of approximately 950,000 words of English discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

DEFT English Committed Belief Annotation is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) CALLFRIEND American English-Non-Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of non-Southern dialects of American English. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND American English-Non-Southern Dialect (LDC96S46).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes. 

CALLFRIEND American English-Non-Southern Dialect Second Edition is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(3) TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 was developed by LDC and contains Chinese, English and Spanish data produced in support of the TAC KBP Cold Start evaluation track conducted from 2012 to 2017. This corpus includes source documents, queries, assessments, manual runs and final assessments. 

In the Cold Start track, systems were evaluated on their ability to construct a new knowledge base (KB) from information provided in a text collection in combination with technologies developed in other TAC KBP tracks -- slot filling, information extraction, question answering and entity discovery and linking. Cold Start systems were required to find all entities in the text, and the KB must have ideally included every person, organization, and geo-political entity as well as all the targeted relations between them. To facilitate the evaluation of those KBs, LDC annotators created sets of queries, human-generated responses to the queries, and assessments of both human and system responses. 

The source data in this release is comprised of English and Spanish newswire and web text collected by LDC for the 2012, 2014 and 2015 evaluations and the 2016 pilot collection. The source collections for the 2016 and 2017 evaluations, which include Chinese data, are available in TAC KBP Evaluation Source Corpora 2016-2017 (LDC2019T12). The archived 2013 Cold Start source data collection is available from NIST upon request.

TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(4) IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Amharic conversational and scripted telephone speech collected in 2014 along with corresponding transcripts.

The Amharic speech in this release represents the Addis Ababa, Shewa, and Gondar dialect regions of Ethiopia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, October 15, 2019

LDC 2019 October Newsletter

Membership Year 2020 Publication Preview
LDC data and commercial technology development 

New Publications: 
BOLT English Treebank - Discussion Forum 
Polish Speech Database 
2016 NIST Speaker Recognition Evaluation Test Set 
______________________________________________________________ 

Membership Year 2020 Publication Preview 

The 2020 Membership Year is just around the corner and plans for next year’s publications are in progress. Among the expected releases are:

Abstract Meaning Representation (AMR) Annotation Release 3.0: semantic treebank of over 59,000 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums; updates the second version (LDC2017T10) with new annotations
TAC KBP: English sentiment slot filling, surprise slot filling, nugget detection and coreference, and event argument data in all languages (English, Chinese and Spanish)
DEFT Chinese ERE: Chinese discussion forum data annotated for entities, relations and events
LibriVox Spanish: 73 hours of Spanish audiobook read speech and transcripts
IARPA Babel Language Packs (telephone speech and transcripts): languages include Dhuluo, Javanese and Mongolian
HAVIC Med Training data: web video, metadata, and annotations for developing multimedia systems
RATS Speaker Identification: conversational telephone speech in Levantine Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation of speech segments for speaker identification
BOLT: discussion forums, SMS/chat, conversational telephone speech, word-aligned, tagged and co-reference data in all languages (Chinese, Egyptian Arabic, and English)

Check your inbox in the coming weeks for more information about membership renewal. 

LDC data and commercial technology development 

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
______________________________________________________________

New publications:  

(1) BOLT English Treebank - Discussion Forum was developed by LDC and consists of 268,907 tokens of English web discussion forum data with part-of-speech and syntactic structure annotations collected for the DARPA BOLT (Broad Operational Language Translation) program.

Part-of-speech and treebank annotation conformed to Penn Treebank II style, incorporating changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program. Supplementary guidelines for English treebanks and web text are included with this release.

The source data is English discussion forum web text collected by LDC in 2011 and 2012. A subset of that data -- 702 files representing 268,907 tokens -- was selected for the treebank and annotated for word-level tokenization, part-of-speech and syntactic structure. The unannotated English source data is released as BOLT English Discussion Forums (LDC2017T11).

BOLT English Treebank - Discussion Forum is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

* 

(2) Polish Speech Database was developed by VoiceLab and consists of 263,424 utterances of Polish speech data from 200 speakers, totaling approximately 280 hours, and corresponding transcripts.

Data collection was performed in Poland. Speakers were asked to record themselves reading text on a website for at least 60 minutes from their home computer while using a headset. The read text was comprised of sentences covering most speech sounds in Polish.

This release includes speaker metadata. There were 103 male speakers and 97 female speakers, ranging from 15 – 60 years of age; most speakers were in the 15 – 30 years age range.

Polish Speech Database is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

* 

(3) 2016 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology) and contains approximately 340 hours of short segments of Tagalog, Cantonese, Cebuano and Mandarin telephone speech used as development and test data in the NIST-sponsored 2016 Speaker Recognition Evaluation (SRE). 

As in previous evaluations, SRE16 focused on telephone speech recorded over a variety of handset types for the training and test conditions. In addition to development and evaluation data, this corpus also contains trial lists, their associated keys, tables containing metadata information, and evaluation documentation.

The telephone speech data was drawn from the Call My Net 2015 Corpus collected by LDC. Native speakers of Tagalog, Cantonese, Cebuano or Mandarin (220 unique speakers) made a total of ten telephone calls each to people within their existing social networks. Speakers were encouraged to use different telephone instruments in a variety of acoustic settings and were instructed to talk for 8 - 10 minutes per call on a topic of their choice. All conversations were collected outside North America.
 
2016 NIST Speaker Recognition Evaluation Test Set is distributed via web download.  

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, September 17, 2019

LDC 2019 September Newsletter

LDC at Interspeech 2019
_____________________________________________________________________

LDC at Interspeech 2019 

LDC is exhibiting at Interspeech 2019, September 15-19 in Graz, Austria. Stop by Booth F16 to learn more about recent developments at the Consortium and new publications.

Be on the lookout for The Second DIHARD Speech Diarization Challenge (DIHARD II), a special session co-organized by LDC, and the following presentations featuring LDC work: 

The Second DIHARD Diarization Challenge: Dataset - task - and baselines
 Neville Ryant, Christopher Cieri, Mark Liberman (LDC), Kenneth Church (Baidu, USA), Alejandrina Cristia (Laboratoire de Sciences Cognitives et Psycholinguistique), Jun Du (University of Science and Technology of China), Sriram Ganapathy (Indian Institute of Science)
Oral Session, Tuesday September 17, 10:00 – 10:20, Hall 3 

Automatic Detection of Prosodic Focus in American English 
Sunghye Cho and Mark Liberman (LDC), Yong-cheol Lee (Cheongju University)
Poster Session, Wednesday September 18, 16:00 – 18:00, Gallery B 

Automatic detection of ASD in children using acoustic and text features from brief natural conversations 
Sunghye Cho, Mark Liberman, Neville Ryant (LDC), Meredith Cola, Robert T. Schultz, Julia Parish-Morris (Children's Hospital of Philadelphia)
Oral Session, Wednesday September 18, 16:45 – 17:00, Hall 3

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

New publications: 

(1) CALLFRIEND Canadian French Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Canadian French. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Canadian French (LDC96S48).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes. 

CALLFRIEND Canadian French Second Edition is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training was developed by LDC for the DARPA BOLT (Broad Operational Language Translation) program and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

This release consists of Chinese source text and chat conversations collected using two methods: new collection via LDC's collection platform and donation of SMS and chat archives from BOLT collection participants. The source data is released as BOLT Chinese SMS/Chat (LDC2018T15).

The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment, as well as tokenized for character alignment by inserting white spaces to separate characters. 

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Machine Reading Phase 1 NFL Scoring Training Data was developed by LDC for use in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. It contains 110 U.S. NFL (National Football League) scoring source documents and 110 standoff annotation files, manually annotated for instances of NFL Scoring annotation categories defined with respect to a NFL Scoring ontology.

The Machine Reading program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the NFL Scoring Use Cases evaluation, which tested the sports domain by extracting information about scoring events and game outcomes and aligning that information with an NFL Scoring ontology. 

Machine Reading Phase 1 NFL Scoring Training Data is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, August 15, 2019

LDC 2019 August Newsletter

Fall 2019 LDC Data Scholarship Program 

New Publications:

TAC KBP Evaluation Source Corpora 2016-2017 
__________________________________________________________________ 

Fall 2019 LDC Data Scholarship Program 

Students can apply for the Fall 2019 LDC Data Scholarship program now through September 15, 2019. This scholarship program provides eligible students with access to LDC data at no cost. For application requirements and program rules, please visit the LDC Data Scholarship page. 


New publications: 

(1) Corpus of Conversational Persian Transcripts contains transcripts from approximately 20 hours of naturally occurring informal conversations in the Tehrani dialect of Iranian Persian.

This data set is extracted from 1,201 minutes of conversations among 22 participants (12 male and 10 female) who recorded their daily phone calls and face-to-face interactions in a variety of informal settings. Conversations represent various interaction types (dialogue and group conversation), settings (home, office, car, café and restaurant), types of relationship (family, couple, friend, acquaintance), and various communicative goals (joking, explaining, arguing, and complaining, among others). The corresponding speech is not included in this release.

The transcripts were annotated for gender, age, and recording method and setting.

Corpus of Conversational Persian Transcripts is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) TAC KBP Evaluation Source Corpora 2016-2017 was developed by LDC and contains the 180,003 Chinese, English and Spanish source documents used in support of all TAC KBP evaluation tracks conducted in 2016 and 2017.

The source data consists of Chinese, English and Spanish discussion forum and newswire text collected by LDC. Also provided are a series of lists and tables to aid in the recreation of specific test sets.

Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST), developed to encourage research in natural language processing and related applications. The Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. 

TAC KBP Evaluation Source Corpora 2016-2017 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Multi-Language Conversational Telephone Speech 2011 -- East Asian was developed by LDC and is comprised of approximately 19 hours of telephone speech in two distinct languages of East Asia: Thai and Lao.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Calls are labeled by human auditors for callee gender, dialect type, and noise.  

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Multi-Language Conversational Telephone Speech 2011 -- East Asian is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Igbo conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Igbo speech in this release represents the Owerri, Onitsha, and Ngwa dialects spoken in Nigeria. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*