Thursday, February 14, 2019

LDC 2019 February Newsletter

Only two weeks left to enjoy 2019 membership discounts

Spring 2019 LDC Data Scholarship recipients

LDC’s new language game

New publications:

___________________________________________________________

Only two weeks left to enjoy 2019 membership discounts
There is still time to save on 2019 membership fees. Through March 1, all organizations receive a discount on the 2019 membership fee (up to 10%) when they choose to join or renew. For more information on membership benefits, visit Join LDC

Spring 2019 LDC Data Scholarship recipients
Congratulations to the recipients of LDC's Spring 2019 Data Scholarships:

Colin Annand: University of Cincinnati (USA); PhD. Psychology. Colin is awarded a copy of Switchboard-1 Release 2 for his research involving the relationship between speech patterns and conversation content.

Si Chen: Huazhong University of Science and Technology (China); B.S. Communication Engineering. Si is awarded a copy of ACE 2005 Multilingual Training Corpus for his work on event extraction. 

Noor-e-Hira: Fatima Jinnah Women University (Pakistan); MSc. Computer Sciences. Noor is awarded a copy of NIST 2008 Open Machine Translation (OpenMT) Evaluation for her research in machine translation.

Matthew Roddy: Trinity College Dublin (Ireland); Ph.D. Electrical Engineering. Matthew is awarded copies of 2000 HUB5 English Evaluation Speech and Transcripts for his work in spoken dialogue systems.

Ammara Zafar: Fatima Jinnah Women University (Pakistan); MSc Computer Sciences. Ammara awarded a copy of NIST 2009 Open Machine Translation (OpenMT) Evaluation for her research in machine translation.

For information about the program, visit the Data Scholarship page.

LDC’s new language game
LDC’s new language game, NameThatLanguage, tests your skill at recognizing the language spoken in short audio clips. The game includes thousands of clips to prevent memorization and offers a real challenge that increases as you progress. In addition to being fun, the game provides useful data on language confusability and linguistic diversity. Game results will be shared freely for research. New clips and more languages continue to be added providing ongoing challenges and new research data. Help support language research by playing! https://namethatlanguage.org

New publications:

(1) DEFT Chinese Committed Belief Annotation was developed by LDC and consists of approximately 83,000 tokens of Chinese discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

DEFT Chinese Committed Belief Annotation is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.  

*

(2) IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 210 hours of Lithuanian conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Lithuanian speech in this release represents that spoken in the Aukštaitian and Samogitian dialect regions of Lithuania. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 71 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Lithuanian Language Pack IARPA-babel304b-v1.0b is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Multi-Language Conversational Telephone Speech 2011 -- Arabic Group was developed by LDC and is comprised of approximately 117 hours of telephone speech in distinct dialects of colloquial Arabic: Iraqi, Levantine and Maghrebi.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on language pair discrimination for 24 languages/dialects, some of which could be considered mutually intelligible or closely related.

Multi-Language Conversational Telephone Speech 2011 -- Arabic Group is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(4) Multilingual ATlS was developed by Google Inc. and consists of 5,871 utterances from ATIS2 (LDC93S5), ATIS3 Training Data (LDC94S19), and ATIS3 Test Data (LDC95S26) annotated and translated into Hindi and Turkish. 

The ATIS (Air Travel Information Services) collection was developed to support the research and development of speech understanding systems. Participants were presented with various hypothetical travel planning scenarios and asked to solve them by interacting with partially or completely automated ATIS systems. The resulting utterances were recorded and transcribed. Data was collected in the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT Laboratory for Computer Science, National Institute for Standards and Technology and SRI International.

The original English utterances were manually translated into Hindi and Turkish. This release also includes the original English utterance and the machine translation back into English of the manual target language utterance translation. Each utterance is annotated with named entities via table lookup; markers include city, airline, airport names, and dates.

Multilingual ATIS is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.  

Tuesday, January 15, 2019

LDC 2019 January Newsletter

Renew Your LDC Membership Today

New publications:
_________________________________________________________

Renew Your LDC Membership Today
Join LDC while membership savings are still available. Now through March 1, 2019, all organizations receive a discount on the 2019 membership fee (up to 10%) when they choose to join the Consortium or renew their membership. This year’s planned publications include Multilanguage Conversational Telephone Speech (telephone speech in languages/dialects considered mutually intelligible or closely related), IARPA Babel Language Packs (telephone speech and transcripts in underserved languages), Chinese Abstract Meaning Representation Corpus, SRI Speech-Based Collaborative Learning Corpus, data from BOLT, HAVIC, DEFT, TAC KBP and more. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

New publications:
(1) BOLT Arabic Discussion Forum Parallel Training Data was developed by LDC and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data collected for the DARPA BOLT program along with their corresponding English translations.

LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

The source data in this release consists of discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The full source data collection is released as BOLT Arabic Discussion Forums (LDC2018T10).

Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were then segmented into sentence units, formatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's BOLT translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

BOLT Arabic Discussion Forum Parallel Training Data is available as a web download. 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of collaboration, log files, and supporting documentation.

This collection was part of a project investigating the utility of a speech-based learning analytics approach to collaborative learning. The goal was to determine whether detectable patterns exist in student speech that correlate with collaborative learning indicators and to provide a means of assessing collaboration quality. The participants were students in middle schools (grades six, seven and eight) located in California. Students worked in groups of three on sets of short mathematics problems based on the "cloze" task in which each student was assigned one blank and each problem required the students to work together and talk to each other to coordinate their three answers. The problems were presented on iPads with a custom software application and the audio data was captured by both head-mounted and table-top microphones.

SRI Speech-Based Collaborative Learning Corpus is available as a web download. 

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking (EDL) tasks in 2014 and 2015. It includes queries, knowledge base (KB) links, equivalence class clusters for NIL entities, and entity type information for each of the queries. Also included in this data set are all necessary source documents as well as BaseKB - the second reference KB that was adopted for use by EDL in 2015. The first EDL reference KB to which 2014 EDL data are linked is available separately as TAC KBP Reference Knowledge Base (LDC2014T16).

The goal of the EDL track is to conduct end-to-end entity extraction, linking and clustering. For producing gold standard data, given a document collection, annotators (1) extract (identify and classify) entity mentions (queries), link them to nodes in a reference KB and (2) perform cross-document co-reference on within-document entity clusters that cannot be linked to the KB. 

Source data consists of Chinese, English and Spanish newswire and web text collected by LDC. The EDL 2014 task involved English data only. Chinese and Spanish data were added in the 2015 task. 

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 is available as a web download. 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

Monday, December 17, 2018

LDC 2018 December Newsletter

LDC Membership Discounts for MY2019 Still Available

Spring 2019 LDC Data Scholarship Program - deadline approaching
New publications:
LDC Membership Discounts for MY2019 Still Available
Join LDC while membership savings are still available. Now through March 1, 2019, renewing MY2018 members will receive a 10% discount off the membership fee. New or non-consecutive member organizations will receive a 5% discount. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

Spring 2019 LDC Data Scholarship Program - deadline approaching 
Students can apply for the Spring 2019 Data Scholarship Program now through January 15, 2019. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships

New publications:
(1) HUB5 Mandarin Telephone Speech and Transcripts Second Edition was developed by LDC in support of US government projects for language recognition and Large Vocabulary Conversational Speech Recognition (LVCSR). The first edition was released by LDC in two data sets, HUB5 Mandarin Telephone Speech Corpus (LDC98S69) and HUB5 Mandarin Transcripts (LDC98T26). This second edition merges the speech and transcript releases, updates the audio format, and adds Pinyin transcripts, forced alignment, and updated documentation and metadata.

This corpus contains approximately 19 hours of Mandarin speech from 42 unscripted telephone conversations between native speakers of Mandarin from CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55), which has also been released in a second, updated edition (LDC2018S09) and (2) associated transcripts of contiguous 5-30 minute segments from those telephone conversations.

Participants could speak with a person of their choice on any topic; most called family members and friends. The recorded conversations lasted up to 30 minutes. Transcripts were created manually by native Mandarin speakers in the GB2312 encoding schema. This release includes Pinyin transcripts and the original transcripts, both in UTF-8 format. 

HUB5 Mandarin Telephone Speech and Transcripts Second Edition is available via web download. 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(2) Nautilus Speaker Characterization was developed at the Technical University of Berlin and is comprised of approximately 155 hours of conversational speech from 300 German speakers aged 18 to 35 years (126 males and 174 females) with no marked dialect or accent, recorded in an acoustically-isolated room. The corpus was designed to support research on the detection of speaker social characteristics, such as personality, charisma, and voice attractiveness.

Four scripted and four semi-spontaneous dialogs simulating telephone call inquiries were elicited from the speakers. Additionally, spontaneous neutral and emotional speech utterances (predominantly excitement or frustration) and questions were produced.

Speech corresponding to one of the semi-spontaneous dialogs was evaluated with respect to 34 continuous numeric labels of perceived interpersonal speaker characteristics (such as likable, attractive, competent, childish). For a set of 20 selected "extreme" speakers evaluated for their warmth-attractiveness, 34 naive voice descriptions (such as bright, creaky, articulate, melodious) were also evaluated. The corpus contains all labels, together with the speech recordings and the speakers' metadata (e.g., age, gender, place of birth, chronological places of residence and duration of stay, parents' place of birth, self-assessed personality).

Nautilus Speaker Characterization is available via web download. 

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*

(3) TAC Relation Extraction Dataset (TACRED) was developed by The Stanford NLP Group and is a large-scale relation extraction dataset with 106,264 examples built over English newswire and web text used in the NIST TAC KBP English slot filling evaluations during the period 2009-2014. The annotations were derived from TAC KBP relation types (see the guidelines), from human annotations developed by LDC and from crowdsourcing using Mechanical Turk.

Source corpora used for this dataset were TAC KBP Comprehensive English Source Corpora 2009-2014 (LDC2018T03) and TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 (LDC2018T22). For detailed information about the dataset and benchmark results, please refer to the TACRED paper.

TAC Relation Extraction Dataset is available via web download. 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, November 15, 2018

LDC 2018 November Newsletter

Join LDC for Membership Year 2019

Spring 2019 Data Scholarship Program

Commercial use and LDC data

New publications:
_________________________________________________________________

Join LDC for Membership Year 2019
Membership Year 2019 (MY2019) is open and discounts are available for those who keep their membership current and join early in the year. Now through March 1, 2019, current MY2018 members who renew their LDC membership before March 1 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 1. 

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 750 holdings. Current-year for-profit members may use most data for commercial applications. 

Plans for MY2019 publications are in progress. Among the expected releases are:

  • SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation
  • Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)
  • Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)
  • TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data
  • CALLFRIEND Second Edition: updated releases with .wav format audio, simplified directory structure and enhanced documentation and metadata (English, Egyptian Arabic, Mandarin Chinese-Taiwan)
  • HAVIC Med Progress Test data: English web video, metadata, and annotations for developing multimedia systems
  • IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian
  • BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)
And, it’s not too late to join for MY2017 (through December 31, 2018) and MY2018 (through December 31, 2019). Data sets from those years include 2010 NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting and Language Identification releases, CHiME, Noisy TIMIT Speech, Concretely Annotated New York Times and English Gigaword, DIRHA English WSJ Audio, LORELEI Amharic and Somali Language Packs and DEFT Spanish Treebank. For full descriptions of all LDC data sets, browse our Catalog.  

Visit Join LDC for details on membership, user accounts and payment.

Spring 2019 Data Scholarship Program
Applications are now being accepted through January 15, 2019 for the Spring 2019 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.

Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information. 

New publications:
(1) AISHELL-1 was developed by Beijing Shell Shell Technology Co., Ltd. It contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts.

The goal of the collection was to support speech recognition system development in 11 domains, including smart homes, autonomous driving, entertainment, finance and science and technology. Participants read 500 sentences covering the domains; sentences were chosen for their speech and phonetic characteristics. The speech was recorded in a quiet indoor environment on a high fidelity microphone and two mobile phones (Android and IOS). 

Speakers were recruited from different accent areas across China, including North, South and Yue-Gui-Min regions. There were 214 female speakers and 186 male speakers. Additional demographic information about the participants is included in this release.

AISHELL-1 is distributed via hard drive.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Avatar Education Portuguese was developed by the University of Pernambuco and consists of approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning.

The corpus contains 1,400 speakers (700 male, 700 female) who generated 1,400 utterances from read and spontaneous speech. Utterances were transcribed at the word level (without time alignments) and at the phoneme level (with time alignment labels).

Avatar Education Portuguese is distributed via web download. 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) BOLT Egyptian Arabic Treebank - Discussion Forum was developed by LDC and consists of Egyptian Arabic web discussion forum data with part-of-speech annotation, morphology, gloss and syntactic tree annotation collected for the DARPA Broad Operational Language Translation (BOLT) Program. 

The annotations in this release follow Penn Arabic Treebank (PATB) annotation guidelines. There are two kinds of morphological analysis synchronized in the corpus. LDC Standard Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01) was used for Modern Standard Arabic tokens, and CALIMA(Columbia Arabic Language and dIalect Morphological Analyzer) was used for Egyptian-Arabic tokens.

This release contains 440,448 tokens before clitics were split and 508,548 tree tokens after clitics were split for treebank annotation. The source material is web discussion forums collected by LDC from various sources.

The unannotated Egyptian Arabic source data is released as BOLT Arabic Discussion Forums (LDC2018T10).

BOLT Egyptian Arabic Treebank - Discussion Forum is distributed via web download. 

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 201 hours of Telugu conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Telugu speech in this release represents that spoken in the Central, East, South and North Telugu dialect regions of India.The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.




Monday, October 15, 2018

LDC 2018 October Newsletter

In this newsletter: 

Fall 2018 LDC Data Scholarship Recipients
Membership Year 2019 Publication Preview

New Publications:
Concretely Annotated English Gigaword
TRAD Arabic-French Parallel Text -- Newswire
TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014
__________________________________________________________________________

Fall 2018 LDC Data Scholarship Recipients

Congratulations to the recipients of LDC's Fall 2018 Data Scholarships:

Utkrist Adhikari: University of Bonn (Germany); M.Sc, Computer Science. Utkrist is awarded a copy of Treebank-2 for his research in named entity recognition, super sense tagging, and semantic role labeling. 

Vitaliya Remneva: Higher School of Economics, National Research University (Russia); M.Sc, System and Software Engineering. Vitaliya is awarded a copy of ETS Corpus of Non-Native Written English for her work in author profiling through natural language processing.

Tian Xiaoyu: Shanghai International Studies University (China); MA, Linguistics. Tian is awarded a copy of Tagged Chinese Gigaword Version 2.0 for her research in causative construction variations in Mainland Chinese, Taiwan Chinese, and Singapore Chinese. 

W. Victor H. Yarlott: Florida International University (US); Ph.D., School of Computing and Information Sciences. Victor is awarded a copy of ACE2005 Multilingual Training Corpus for his research in relation extraction. 

For information about the program, visit the Data Scholarship page. 

Membership Year 2019 Publication Preview

The 2019 Membership Year is fast approaching and plans for next year’s publications are in progress. Among the expected releases are:

SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation
Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)
Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)
TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data
IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian
HAVIC Med Progress Test data: web video, metadata, and annotations for developing multimedia systems
BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)

Check your inbox in the coming weeks for more information about membership renewal.  

New publications:

(1) Concretely Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to English Gigaword Fifth Edition (LDC2011T07). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization.

Concretely Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition, which consists of newswire stories from seven sources collected by LDC between 1994-2010. 

Concretely Annotated English Gigaword is distributed via hard drive.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed English Gigaword Fifth Edition (LDC2011T07) or Annotated English Gigaword (LDC2012T21) may request a copy of Concretely Annotated English Gigaword for a media fee. Non-members may license this data for a fee.


*

(2) TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014.
Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. 

The regular English Slot Filling evaluation track involved mining information about entities from text. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection. For more information about English Slot Filling, please refer to the 2014 track home page.

This release contains queries, the 'manual runs' (human-produced responses to the queries), and the final rounds of assessment results. 

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(3) TRAD Arabic-French Parallel Text -- Newswire  was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21). The purpose of the PEA-TRAD project (Translation as a Support for Document Analysis) was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. 

This release consists of 813 segments (translations units) from 74 documents. The Arabic source file contains 19,902 words and the French reference translation contains 29,104 words.  The source data is Arabic newswire text collected and translated into English by LDC. Information about the ELDA translation team, translation guidelines, and validation results is contained in the documentation accompanying this release.

TRAD Arabic-French Parallel Text -- Newswire is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Monday, September 17, 2018

LDC 2018 September Newsletter


In this newsletter:

New Publications:



__________________________________________________________________________

New publications:

(1) BOLT Information Retrieval Comprehensive Training and Evaluation was developed by LDC and consists of all data produced in support of the Information Retrieval (IR) task within the DARPA Broad Operational Language Translation (BOLT) Program, including annotations, source documents and scoring software.

The BOLT IR task sought to support development of systems that could take as input a natural language English query sentence, return relevant responses to that query from a large corpus of informal documents in the three BOLT languages (Arabic, Chinese, and English) and translate responses from non-English documents into English. This release contains (1) natural-language IR queries, system responses to queries, and manually-generated assessment judgments for system responses; (2) discussion forum source documents in Arabic, Chinese and English; (3) scoring software for each evaluation phase; and (4) experimental data developed in Phase 2.

BOLT Information Retrieval Comprehensive Training and Evaluation is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 53 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Event E051-E060 is a subset of that corpus, specifically, a collection of event videos for the HAVIC Project originally released to support the 2016 Multimedia Event Detection task.

The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Multi-Language Conversational Telephone Speech 2011 -- Spanish was developed by LDC and is comprised of approximately 23 hours of telephone speech in Spanish.
The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Human auditors labeled the calls for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Multi-Language Conversational Telephone Speech 2011 -- Spanish is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kazakh conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Kazakh speech in this release represents that spoken in the Northeastern and Southern dialect regions of Kazakhstan. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.