Thursday, February 14, 2019

LDC 2019 February Newsletter

Only two weeks left to enjoy 2019 membership discounts

Spring 2019 LDC Data Scholarship recipients

LDC’s new language game

New publications:

___________________________________________________________

Only two weeks left to enjoy 2019 membership discounts
There is still time to save on 2019 membership fees. Through March 1, all organizations receive a discount on the 2019 membership fee (up to 10%) when they choose to join or renew. For more information on membership benefits, visit Join LDC

Spring 2019 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Spring 2019 Data Scholarships:

Colin Annand: University of Cincinnati (USA); PhD. Psychology. Colin is awarded a copy of Switchboard-1 Release 2 for his research involving the relationship between speech patterns and conversation content.

Si Chen: Huazhong University of Science and Technology (China); B.S. Communication Engineering. Si is awarded a copy of ACE 2005 Multilingual Training Corpus for his work on event extraction. 

Noor-e-Hira: Fatima Jinnah Women University (Pakistan); MSc. Computer Sciences. Noor is awarded a copy of NIST 2008 Open Machine Translation (OpenMT) Evaluation for her research in machine translation.

Matthew Roddy: Trinity College Dublin (Ireland); Ph.D. Electrical Engineering. Matthew is awarded copies of 2000 HUB5 English Evaluation Speech and Transcripts for his work in spoken dialogue systems.

Ammara Zafar: Fatima Jinnah Women University (Pakistan); MSc Computer Sciences. Ammara awarded a copy of NIST 2009 Open Machine Translation (OpenMT) Evaluation for her research in machine translation.

For information about the program, visit the Data Scholarship page.

LDC’s new language game
LDC’s new language game, NameThatLanguage, tests your skill at recognizing the language spoken in short audio clips. The game includes thousands of clips to prevent memorization and offers a real challenge that increases as you progress. In addition to being fun, the game provides useful data on language confusability and linguistic diversity. Game results will be shared freely for research. New clips and more languages continue to be added providing ongoing challenges and new research data. Help support language research by playing! https://namethatlanguage.org

New publications:

(1) DEFT Chinese Committed Belief Annotation was developed by LDC and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

DEFT Chinese Committed Belief Annotation is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.  

*

(2) IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 210 hours of Lithuanian conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Lithuanian speech in this release represents that spoken in the Aukštaitian and Samogitian dialect regions of Lithuania. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 71 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Multi-Language Conversational Telephone Speech 2011 -- Arabic Group was developed by LDC and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some of which could be considered mutually intelligible or closely related.

Multi-Language Conversational Telephone Speech 2011 -- Arabic Group is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(4) Multilingual ATlS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish. 

The ATIS (Air Travel Information Services) collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory for Computer Science, National Institute for Standards and Technology and SRI International.

The original English utterances were manually translated into Hindi and Turkish. This release also includes the original English utterance and the machine translation back into English of the manual target language utterance translation. Each utterance is annotated with named entities via table lookup; markers include city, airline, airport names, and dates.

Multilingual ATIS is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.  

Tuesday, January 15, 2019

LDC 2019 January Newsletter

Renew Your LDC Membership Today

New publications:
_________________________________________________________

Renew Your LDC Membership Today
Join LDC while membership savings are still available. Now through March 1, 2019, all organizations receive a discount on the 2019 membership fee (up to 10%) when they choose to join the Consortium or renew their membership. This year’s planned publications include Multilanguage Conversational Telephone Speech (telephone speech in languages/dialects considered mutually intelligible or closely related), IARPA Babel Language Packs (telephone speech and transcripts in underserved languages), Chinese Abstract Meaning Representation Corpus, SRI Speech-Based Collaborative Learning Corpus, data from BOLT, HAVIC, DEFT, TAC KBP and more. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

New publications:
(1) BOLT Arabic Discussion Forum Parallel Training Data was developed by LDC and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations.

LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

The source data in this release consists of discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The full source data collection is released as BOLT Arabic Discussion Forums (LDC2018T10).

Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were then segmented into sentence units, formatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's BOLT translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

BOLT Arabic Discussion Forum Parallel Training Data is available as a web download. 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of collaboration, log files, and supporting documentation.

This collection was part of a project investigating the utility of a speech-based learning analytics approach to collaborative learning. The goal was to determine whether detectable patterns exist in student speech that correlate with collaborative learning indicators and to provide a means of assessing collaboration quality. The participants were students in middle schools (grades six, seven and eight) located in California. Students worked in groups of three on sets of short mathematics problems based on the "cloze" task in which each student was assigned one blank and each problem required the students to work together and talk to each other to coordinate their three answers. The problems were presented on iPads with a custom software application and the audio data was captured by both head-mounted and table-top microphones.

SRI Speech-Based Collaborative Learning Corpus is available as a web download. 

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015. It includes queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information for each of the queries. Also included in this data set are all necessary source documents as well as BaseKB - the second reference KB that was adopted for use by EDL in 2015. The first EDL reference KB to which 2014 EDL data are linked is available separately as TAC KBP Reference Knowledge Base (LDC2014T16).

The goal of the EDL track is to conduct end-to-end entity extraction, linking and clustering. For producing gold standard data, given a document collection, annotators (1) extract (identify and classify) entity mentions (queries), link them to nodes in a reference KB and (2) perform cross-document co-reference on within-document entity clusters that cannot be linked to the KB. 

Source data consists of Chinese, English and Spanish newswire and web text collected by LDC. The EDL 2014 task involved English data only. Chinese and Spanish data were added in the 2015 task. 

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 is available as a web download. 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.