Tuesday, September 17, 2019

LDC 2019 September Newsletter


New Publications:
_____________________________________________________________________

LDC at Interspeech 2019

LDC is exhibiting at Interspeech 2019, September 15-19 in Graz, Austria. Stop by Booth F16 to learn more about recent developments at the Consortium and new publications.

Be on the lookout for The Second DIHARD Speech Diarization Challenge (DIHARD II), a special session co-organized by LDC, and the following presentations featuring LDC work:

The Second DIHARD Diarization Challenge: Dataset - task - and baselines
Neville Ryant, Christopher Cieri, Mark Liberman (LDC), Kenneth Church (Baidu, USA), Alejandrina Cristia (Laboratoire de Sciences Cognitives et Psycholinguistique), Jun Du (University of Science and Technology of China), Sriram Ganapathy (Indian Institute of Science)
Oral Session, Tuesday September 17, 10:00 – 10:20, Hall 3

Automatic Detection of Prosodic Focus in American English
Sunghye Cho and Mark Liberman (LDC), Yong-cheol Lee (Cheongju University)
Poster Session, Wednesday September 18, 16:00 – 18:00, Gallery B

Automatic detection of ASD in children using acoustic and text features from brief natural conversations
Sunghye Cho, Mark Liberman, Neville Ryant (LDC), Meredith Cola, Robert T. Schultz, Julia Parish-Morris (Children's Hospital of Philadelphia)
Oral Session, Wednesday September 18, 16:45 – 17:00, Hall 3

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!  

New publications:

(1) CALLFRIEND Canadian French Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Canadian French. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Canadian French (LDC96S48).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes. 

CALLFRIEND Canadian French Second Edition is distributed via web download.  

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training was developed by LDC for the DARPA BOLT (Broad Operational Language Translation) program and consists of 388,027 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations. 

This release consists of Chinese source text and chat conversations collected using two methods: new collection via LDC's collection platform and donation of SMS and chat archives from BOLT collection participants. The source data is released as BOLT Chinese SMS/Chat (LDC2018T15).

The BOLT word alignment task was built on treebank annotation. LDC automatically extracted Chinese source tokens, including empty categories/traces, from word-segmented files provided by the BOLT Chinese Treebank annotation team at Brandeis University. The word-segmented tokens were then used to automatically generate ctb (Chinese Treebank) alignment, as well as tokenized for character alignment by inserting white spaces to separate characters.

BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Machine Reading Phase 1 NFL Scoring Training Data was developed by LDC for use in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. It contains 110 U.S. NFL (National Football League) scoring source documents and 110 standoff annotation files, manually annotated for instances of NFL Scoring annotation categories defined with respect to a NFL Scoring ontology.

The Machine Reading program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the NFL Scoring Use Cases evaluation, which tested the sports domain by extracting information about scoring events and game outcomes and aligning that information with an NFL Scoring ontology.

Machine Reading Phase 1 NFL Scoring Training Data is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, August 15, 2019

LDC 2019 August Newsletter

Fall 2019 LDC Data Scholarship Program 

New Publications:

TAC KBP Evaluation Source Corpora 2016-2017 
__________________________________________________________________ 

Fall 2019 LDC Data Scholarship Program 

Students can apply for the Fall 2019 LDC Data Scholarship program now through September 15, 2019. This scholarship program provides eligible students with access to LDC data at no cost. For application requirements and program rules, please visit the LDC Data Scholarship page. 


New publications: 

(1) Corpus of Conversational Persian Transcripts contains transcripts from approximately 20 hours of naturally occurring informal conversations in the Tehrani dialect of Iranian Persian.

This data set is extracted from 1,201 minutes of conversations among 22 participants (12 male and 10 female) who recorded their daily phone calls and face-to-face interactions in a variety of informal settings. Conversations represent various interaction types (dialogue and group conversation), settings (home, office, car, café and restaurant), types of relationship (family, couple, friend, acquaintance), and various communicative goals (joking, explaining, arguing, and complaining, among others). The corresponding speech is not included in this release.

The transcripts were annotated for gender, age, and recording method and setting.

Corpus of Conversational Persian Transcripts is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) TAC KBP Evaluation Source Corpora 2016-2017 was developed by LDC and contains the 180,003 Chinese, English and Spanish source documents used in support of all TAC KBP evaluation tracks conducted in 2016 and 2017.

The source data consists of Chinese, English and Spanish discussion forum and newswire text collected by LDC. Also provided are a series of lists and tables to aid in the recreation of specific test sets.

Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST), developed to encourage research in natural language processing and related applications. The Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. 

TAC KBP Evaluation Source Corpora 2016-2017 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Multi-Language Conversational Telephone Speech 2011 -- East Asian was developed by LDC and is comprised of approximately 19 hours of telephone speech in two distinct languages of East Asia: Thai and Lao.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Calls are labeled by human auditors for callee gender, dialect type, and noise.  

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Multi-Language Conversational Telephone Speech 2011 -- East Asian is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 207 hours of Igbo conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Igbo speech in this release represents the Owerri, Onitsha, and Ngwa dialects spoken in Nigeria. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Igbo Language Pack IARPA-babel306b-v2.0c is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

Monday, July 15, 2019

LDC 2019 July Newsletter

In this newsletter:

Fall 2019 LDC Data Scholarship Program
 

LDC data and commercial technology development

New Publications:  

The DKU-JNU-EMA Electromagnetic Articulography Database
Phrase Detectives Corpus Version 2
First DIHARD Challenge Evaluation - Nine Sources
First DIHARD Challenge Evaluation – SEEDLingS
__________________________________________________________
 

Fall 2019 LDC Data Scholarship Program

Student applications for the Fall 2019 LDC Data Scholarship program are being accepted now through September 15, 2019. This scholarship program provides eligible students with access to LDC data at no cost. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
 
__________________________________________________________

New publications:

(1) The DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for each dialect.

Articulatory measurements were made using the NDI electromagnetic articulography wave research system to capture real-time vocal tract variable trajectories. Subjects had six sensors placed in various locations in their mouth and one reference sensor was placed on the bridge of their nose. For simultaneous recording of speech signals, subjects also wore a head-mounted close-talk microphone.

Speakers engaged in four different types of recording sessions: one in which they read complete sentences or short texts, and three sessions in which they read related words of a specific common consonant, vowel or tone.

DKU-JNU-EMA Electromagnetic Articulography Database is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000.

*

(2) Phrase Detectives Corpus Version 2 was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the Phrase Detectives Game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference.

This release constitutes a new version of the Phrase Detectives Corpus (LDC2017T08), adding significantly more annotated tokens to the data set and supplying players’ judgments and a silver label annotation based on the probabilistic aggregation method for anaphoric information for each markable.

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. The annotation is a simplified form of the coding scheme used in The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Phrase Detectives Corpus Version 2 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.
 
*

(3) First DIHARD Challenge Evaluation - Nine Sources was developed by LDC and contains approximately 18 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge.

The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions as follows (all sources are in English unless otherwise indicated):
 
  • Autism Diagnostic Observation Schedule (ADOS) interviews
  • Conversations in Restaurants
  • DCIEM/HCRC map task (LDC96S38)
  • Audiobook recordings from LibriVox
  • Meeting speech collected by LDC in 2001 for the ROAR project (see, e.g., ISL Meeting Speech Part 1 (LDC2004S05))
  • 2001 U.S. Supreme Court oral arguments
  • Mixer 6 Speech (LDC2013S02)
  • Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project
  • YouthPoint radio interviews
This release, when combined with First DIHARD Challenge Evaluation - SEEDLingS (LDC2019S13), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation - Nine Sources is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $300.

* 

(4) First DIHARD Challenge Evaluation – SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge.

The source data was drawn from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings for SEEDLingS were generated in the home environment of 44 infants from 6-18 months of age in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge.

This release, when combined with First DIHARD Challenge Evaluation - Nine Sources (LDC2019S12), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation – SEEDLingS is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $50. 
 *

Monday, June 17, 2019

LDC 2019 June Newsletter

In this newsletter:

New Publications:
USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition
First DIHARD Challenge Development - Eight Sources
_____________________________________________________________________

New publications:  

(1) DEFT Spanish Committed Belief Annotation was developed by LDC and consists of approximately 67,000 tokens of Spanish discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources. 

DEFT Spanish Committed Belief Annotation is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a cost. 

*

(2) USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition was developed by IBM as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project and contains approximately 168 hours of interviews from 682 Holocaust witnesses along with transcripts, a lexicon and other documentation. This release augments USC-SFI MALACH Interviews and Transcripts English (LDC2012S05) by modifying and updating a subset of the original corpus for use with speech recognition systems, such as the Kaldi toolkit.

Specifically, the audio data has been converted from unsegmented mpeg files to a segmented flac compressed format. The speaker-turn, time-stamped transcripts have been updated to an utterance-by-utterance format. A lexicon mapping words to phonemes is provided, and the data is divided into development and training sets.  

The goal of the MALACH project was to develop methods for improved access to large multinational spoken archives in order to advance the state of the art of automatic speech recognition and information retrieval. The characteristics of the USC-SFI collection -- unconstrained, natural speech filled with disfluencies, heavy accents, age-related coarticulations, un-cued speaker and language switching and emotional speech -- were considered well-suited for that task. 

USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*

(3) First DIHARD Challenge Development - Eight Sources was developed by LDC and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool.


The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions as follows (all sources are in English unless otherwise indicated):
  • Autism Diagnostic Observation Schedule (ADOS) interviews
  • DCIEM/HCRC map task (LDC96S38)
  • Audiobook recordings from LibriVox
  • Meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) Development (LDC2007S11) and Evaluation (LDC2007S12) releases.
  • 2001 U.S. Supreme Court oral arguments
  • Sociolinguistic interviews from SLX Corpus of Classic Sociolinguistic Interviews (LDC2003T15)
  • Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project
  • YouthPoint radio interviews

First DIHARD Challenge Development - Eight Sources is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.  2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a cost.

*

(4) First DIHARD Challenge Development - SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge. This release, when combined with First DIHARD Challenge Development - Eight Sources (LDC2019S09), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool.

The source data was drawn from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings for SEEDLingS were generated in the home environment of 44 infants from 6-18 months of age in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge.

The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions.

First DIHARD Challenge Development – SEEDLingS is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a cost.

*

Monday, May 20, 2019

LDC 2019 May Newsletter

New Publications:
Multi-Language Conversational Telephone Speech 2011 -- English Group  
IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c __________________________________________________________ 

New publications: 

(1) Multi-Language Conversational Telephone Speech 2011 -- English Group was developed by LDC and is comprised of approximately 18 hours of telephone speech in two general varieties of English: American and South Asian. 

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Calls are labeled by human auditors for callee gender, dialect type, and noise.   

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series: 

Multi-Language Conversational Telephone Speech 2011 -- English Group is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 2014. This release includes queries, the 'manual runs' (human-produced responses to the queries), the final rounds of assessment results, and the complete set of Chinese source documents.

The regular Chinese Slot Filling evaluation track involved mining information about entities from text. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection.

TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(3) CIEMPIESS Experimentation (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Facultad de Ingeniería at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Experimentation is a set of three different data sets, specifically Complementary, Fem, and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 10 hours of broadcast speech and transcripts and is intended for use as a standard test data set alongside other CIEMPIESS corpora.

Most of the speech recordings in Fem and Test were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). Those two channels feature videos with speech around legal issues and topics related to UNAM. The Complementary recordings consist of read speech collected for that corpus.

LDC has released the following data sets in the CIEMPIESS series: 
CIEMPIESS Experimentation is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

*
(4) IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. This corpus contains approximately 198 hours of Guarani conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Guarani speech in this release represents that spoken in Paraguay. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Guarani Language Pack IARPA-babel305b-v1.0c is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

Monday, April 15, 2019

LDC 2019 April Newsletter

LDC at ICASSP 2019

LDC data and commercial technology development


New Publications:
BOLT Egyptian-English Word Alignment -- Discussion Forum Training
Chinese Abstract Meaning Representation 1.0
HAVIC MED Progress Test -- Videos, Metadata and Annotation ____________________________________________________________

LDC at ICASSP 2019
LDC will be exhibiting at ICASSP 2019, held this year May 12-17 in Brighton, UK. Stop by booth 5 to learn more about recent developments at the Consortium and new publications.

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) BOLT Egyptian-English Word Alignment -- Discussion Forum Training was developed by LDC and consists of 400,448 words of Egyptian Arabic and English parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes and is released as BOLT Arabic Discussion Forums (LDC2018T10).

The BOLT word alignment task was built on treebank annotation. Egyptian source tree tokens for word alignment were automatically extracted from tree files of BOLT Egyptian Arabic Treebank annotation on the discussion forum data. Human annotators then followed LDC guidelines to link words and phrases in Arabic to those in English.

BOLT Egyptian-English Word Alignment -- Discussion Forum Training is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(2) Chinese Abstract Meaning Representation 1.0 was developed by Brandeis University and Nanjing Normal University and is comprised of semantic representations of a set of Chinese sentences from the weblog and discussion forum portions of Chinese Treebank 8.0 (LDC2013T21). Annotations were applied to 10,149 sentences, with 176 sentences unannotated.

Abstract Meaning Representation (AMR) captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree structure. Chinese AMR is based on the annotation methodology developed for English with adaptations for handling specific Chinese phenomena. The goal of the Chinese AMR project is to create a large aligned AMR corpus, of which this data set is the first release. For more information about the project, see the Chinese AMR homepage.

Chinese Abstract Meaning Representation 1.0 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(3) HAVIC MED Progress Test -- Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 3,650 hours of user-generated videos with annotation and metadata.

In a collaboration with NIST (the National Institute of Standards and Technology) to advance multimodal event detection and related technologies, LDC developed a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Progress Test is a subset of that corpus, specifically, a collection of event and background videos originally released to support the 2012-2015 MED tasks.

This release consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Progress Test -- Videos, Metadata and Annotation is distributed via hard drive.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.