Monday, October 15, 2018

LDC 2018 October Newsletter

In this newsletter: 

Fall 2018 LDC Data Scholarship Recipients
Membership Year 2019 Publication Preview

New Publications:
Concretely Annotated English Gigaword
TRAD Arabic-French Parallel Text -- Newswire
TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014
__________________________________________________________________________

Fall 2018 LDC Data Scholarship Recipients

Congratulations to the recipients of LDC's Fall 2018 Data Scholarships:

Utkrist Adhikari: University of Bonn (Germany); M.Sc, Computer Science. Utkrist is awarded a copy of Treebank-2 for his research in named entity recognition, super sense tagging, and semantic role labeling. 

Vitaliya Remneva: Higher School of Economics, National Research University (Russia); M.Sc, System and Software Engineering. Vitaliya is awarded a copy of ETS Corpus of Non-Native Written English for her work in author profiling through natural language processing.

Tian Xiaoyu: Shanghai International Studies University (China); MA, Linguistics. Tian is awarded a copy of Tagged Chinese Gigaword Version 2.0 for her research in causative construction variations in Mainland Chinese, Taiwan Chinese, and Singapore Chinese. 

W. Victor H. Yarlott: Florida International University (US); Ph.D., School of Computing and Information Sciences. Victor is awarded a copy of ACE2005 Multilingual Training Corpus for his research in relation extraction. 

For information about the program, visit the Data Scholarship page. 

Membership Year 2019 Publication Preview

The 2019 Membership Year is fast approaching and plans for next year’s publications are in progress. Among the expected releases are:

SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation
Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)
Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)
TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data
IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian
HAVIC Med Progress Test data: web video, metadata, and annotations for developing multimedia systems
BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)

Check your inbox in the coming weeks for more information about membership renewal.  

New publications:

(1) Concretely Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to English Gigaword Fifth Edition (LDC2011T07). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization.

Concretely Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition, which consists of newswire stories from seven sources collected by LDC between 1994-2010. 

Concretely Annotated English Gigaword is distributed via hard drive.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed English Gigaword Fifth Edition (LDC2011T07) or Annotated English Gigaword (LDC2012T21) may request a copy of Concretely Annotated English Gigaword for a media fee. Non-members may license this data for a fee.


*

(2) TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014.
Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. 

The regular English Slot Filling evaluation track involved mining information about entities from text. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection. For more information about English Slot Filling, please refer to the 2014 track home page.

This release contains queries, the 'manual runs' (human-produced responses to the queries), and the final rounds of assessment results. 

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(3) TRAD Arabic-French Parallel Text -- Newswire  was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21). The purpose of the PEA-TRAD project (Translation as a Support for Document Analysis) was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. 

This release consists of 813 segments (translations units) from 74 documents. The Arabic source file contains 19,902 words and the French reference translation contains 29,104 words.  The source data is Arabic newswire text collected and translated into English by LDC. Information about the ELDA translation team, translation guidelines, and validation results is contained in the documentation accompanying this release.

TRAD Arabic-French Parallel Text -- Newswire is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Monday, September 17, 2018

LDC 2018 September Newsletter


In this newsletter:

New Publications:



__________________________________________________________________________

New publications:

(1) BOLT Information Retrieval Comprehensive Training and Evaluation was developed by LDC and consists of all data produced in support of the Information Retrieval (IR) task within the DARPA Broad Operational Language Translation (BOLT) Program, including annotations, source documents and scoring software.

The BOLT IR task sought to support development of systems that could take as input a natural language English query sentence, return relevant responses to that query from a large corpus of informal documents in the three BOLT languages (Arabic, Chinese, and English) and translate responses from non-English documents into English. This release contains (1) natural-language IR queries, system responses to queries, and manually-generated assessment judgments for system responses; (2) discussion forum source documents in Arabic, Chinese and English; (3) scoring software for each evaluation phase; and (4) experimental data developed in Phase 2.

BOLT Information Retrieval Comprehensive Training and Evaluation is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation was developed by LDC and is comprised of approximately 53 hours of user-generated videos with annotation and metadata. To advance multimodal event detection and related technologies, LDC developed, in collaboration with NIST (the National Institute of Standards and Technology), a large, heterogeneous, annotated multimodal corpus for HAVIC (the Heterogeneous Audio Visual Internet Collection) that was used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC MED Event E051-E060 is a subset of that corpus, specifically, a collection of event videos for the HAVIC Project originally released to support the 2016 Multimedia Event Detection task.

The data consists of videos of various events (event videos) and videos completely unrelated to events (background videos) harvested by a large team of human annotators. Each event video was manually annotated with a set of judgments describing its event properties and other salient features. Background videos were labeled with topic and genre categories.

HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) Multi-Language Conversational Telephone Speech 2011 -- Spanish was developed by LDC and is comprised of approximately 23 hours of telephone speech in Spanish.
The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Human auditors labeled the calls for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Multi-Language Conversational Telephone Speech 2011 -- Spanish is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 203 hours of Kazakh conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Kazakh speech in this release represents that spoken in the Northeastern and Southern dialect regions of Kazakhstan. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 64 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Kazakh Language Pack IARPA-babel302b-v1.0a is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Thursday, August 16, 2018

LDC 2018 August Newsletter

LDC at Interspeech 2018

Fall 2018 LDC Data Scholarship Program

New Publications:
_______________________________________________________________________

LDC at Interspeech 2018
LDC will participate in various ways at Interspeech 2018 held this year in Hyderabad, India, September 2-6. It is co-organizing the special session, The First DIHARD Speech Diarization Challenge, on September 3 and is a sponsor of the September 1 pre-conference workshop, Young Female Researchers in Speech Science & Technology  (YFRSW). Results of recent work will be presented during the poster session on September 3, “Global TIMIT: Acoustic Phonetic Datasets for the World’s Languages.”

Fall 2018 LDC Data Scholarship Program
Students can apply for the Fall 2018 Data Scholarship Program now through September 15, 2018. The LDC Data Scholarship program provides students with access to LDC data at no cost. For more information on application requirements and program rules, please visit LDC Data Scholarships

New publications:
(1) BOLT English SMS/Chat was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection from  native  English speakers. The corpus contains 18,429 conversations totaling 3,674,802 words across 375,967 messages.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

BOLT English SMS/Chat is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
(2) CIEMPIESS Balance (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Development of Speech Technologies program at the School of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish broadcast speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the CIEMPIESS-UNAM Project website.

CIEMPIESS Balance is a companion corpus to CIEMPIESS Light, released by LDC as LDC2017S23. It was developed so that the data sets together constitute a gender-balanced corpus. The gender breakdown in CIEMPIESS Light is approximately 75% male and 25% female. In CIEMPIESS Balance, the gender breakdown is approximately 25% male and 75% female.

The majority of the speech recordings were collected from Radio-IUS, a UNAM radio station. Other recordings were taken from IUS Canal Multimedia and Centro Universitario de Estudios Jurídicos (CUEJ UNAM). These two channels feature videos with speech around legal issues and topics related to UNAM.

CIEMPIESS Balance is available via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.
*
(3) 2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by LDC between 2009 and 2011 in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Panjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian, and Urdu.

The 2011 evaluation emphasized the language pair condition and involved both conversational telephone speech (CTS) and broadcast narrow-band speech (BNBS).

This release includes training data for nine language varieties that had not been represented in prior LRE cycles -- Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Panjabi, Polish and Slovak -- contained in 893 audited segments of roughly 30 seconds duration and in 400 full-length CTS recordings. The evaluation test set comprises a total of 29,511 audio files, all manually audited at LDC for language and divided equally into three different test conditions according to the nominal amount of speech content per segment.

LDC released the prior LREs as:
  • 2003 NIST Language Recognition Evaluation (LDC2006S31)
  • 2005 NIST Language Recognition Evaluation (LDC2008S05)
  • 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
  • 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)
  • 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06)
2011 NIST Language Recognition Evaluation Test Set is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, July 16, 2018

LDC 2018 July Newsletter


Fall 2018 Data Scholarship Program

New Publications:
_________________________________________________________________________

Fall 2018 LDC Data Scholarship Program

Student applications for the Fall 2018 LDC Data Scholarship program are being accepted now through September 15, 2018. This scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.

New publications:

(1) CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition was developed by LDC and consists of approximately 24 hours of unscripted telephone conversations between native speakers of the Mandarin Chinese dialect spoken in mainland China. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND Mandarin Chinese-Mainland Dialect (LDC96S55).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) RATS Language Identification was developed by LDC and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Language Identification (LID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The source audio consists of conversational telephone speech recordings from: (1) conversational telephone speech (CTS) recordings, taken either from previous LDC CTS corpora, or from CTS data collected specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers; and (2) portions of VOA broadcast news recordings, taken from data used in the 2009 NIST Language Recognition Evaluation. The 2009 LRE Test Set is available from LDC as LDC2014S06.

CTS recordings were audited by annotators who listened to short segments and determined whether the audio was in the target language. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, language ID and LID provenance.

RATS Language Identification is distributed via hard drive.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TRAD Chinese-French Parallel Text -- Broadcast News was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 30,000 Chinese characters from GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 (LDC2008T18). The purpose of the PEA-TRAD project (Translation as a Support for Document Analysis) was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains.

This release consists of 977 segments (translation units) from 139 documents. The Chinese source file contains 33,571 characters and the French reference translation contains 22,424 words.  The source data is Chinese broadcast news collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program. Information about the ELDA translation team, translation guidelines and validation results is contained in the documentation accompanying this release.

TRAD Chinese-French Parallel Text – Broadcast News is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, June 18, 2018

LDC 2018 June Newsletter

LDC Catalog certified as CoreTrustSeal data repository

LDC data and commercial technology development

New Publications:
IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b
__________________________________________________________________________

LDC Catalog certified as CoreTrustSeal data repository
LDC is pleased to announce that the Catalog has been awarded the CoreTrustSeal for recognition as a trustworthy data repository. This means that the Catalog meets a series of standards covering data access, rights management, curation, and storage developed by the ISCU World Data System and the Data Seal of Approval. LDC joins the other 136 certified repositories around the globe in the commitment to promote sustainable and trustworthy data infrastructures.  

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) BOLT Chinese SMS/Chat was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The corpus contains 14,877 conversations totaling 3,005,810 words across 497,543 messages.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources – discussion forums, text messaging, and chat – in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference. The data in this release was collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants.

BOLT Chinese SMS/Chat is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) Multi-Language Conversational Telephone Speech 2011 -- Central European was developed by LDC and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Human auditors labeled the calls for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:
·        Slavic Group (LDC2016S11)
·        Turkish (LDC2017S09)
·        South Asian (LDC2017S14)
·        Central Asian (LDC2018S03)

Multi-Language Conversational Telephone Speech 2011 -- Central European is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010, 2011, 2012, and 2013. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities. Also included are the source documents for the queries, specifically, English newswire, discussion forum, and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16). Also included in this package are the results of an Entity Linking IAA (Inter-Annotator Agreement) study conducted in 2010.

TAC KBP encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. English Entity Linking was first conducted as part of the 2009 TAC KBP evaluations. Its goal is to measure systems' ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base (KB) and, if so, to create a link between the two. If there is no matching node for a query entity in the KB, EL systems are required to cluster the mention together with others referencing the same entity.

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 191 hours of Cebuano conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Cebuano speech in this release represents that spoken in the Cebu-North Kana, Sialo, and Mindanao dialect regions of the Philippines. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Tuesday, May 15, 2018

LDC 2018 May Newsletter


New Publications:
__________________________________________________________

New publications:

(1) Rhythm and Pitch contains approximately 27 minutes of spontaneous English conversations and radio news stories annotated with the Rhythm and Pitch (RaP) scheme. Speech data for annotation was taken from two corpora released by LDC, CALLHOME American English Speech (LDC97S42) and Boston University Radio Speech Corpus (LDC96S36).

The RaP system permits the capture of both intonational and rhythmic aspects of speech. Four labeling tiers are used for annotating speech prosody. These tiers carry information about the syllabic organization and orthography of the speech, its rhythmic structure, tonal patterns, and other information. More information about the RaP system is available on the RaP homepage.

Speech data are presented as flac compressed 16-bit wav files. The Boston data are one channel 16kHz files, while the CALLHOME data are either one or two channel 8kHz files. Annotations are UTF-8 encoded Praat TextGrids.

Rhythm and Pitch is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*

(2) GALE Phase 4 Arabic Broadcast News Speech was developed by LDC and is comprised of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast News Transcripts (LDC2018T14).

The recordings in this release feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Arabiya, a news television station based in Dubai; Al Baghdadya, an Iraqi broadcast programmer; Alhurra, a U.S. government-funded regional broadcaster; Al Iraqiyah, an Iraqi television station; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a national broadcast station based in Kuwait; Radio Sawa, a U.S. government-funded regional broadcaster; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Yemen TV, a television station based in Yemen.

This release contains 51 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 4 Arabic Broadcast News Speech is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) GALE Phase 4 Arabic Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 37 hours of Arabic broadcast news speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast News Speech (LDC2018S05).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 204,735 tokens. The transcripts were created with the LDC tool XTrans, which supports manual transcription and annotation of audio recordings.

GALE Phase 4 Arabic Broadcast News Transcripts is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Friday, April 13, 2018

LDC 2018 April Newsletter


LDC at ICASSP 2018

LDC at the Philadelphia Science Carnival

New Publications:
_____________________________________________________________________
LDC at ICASSP 2018
LDC will be exhibiting at ICASSP 2018, held this year April 15-20 in Calgary, Canada. Stop by booth B2 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Enhancement and Analysis of Conversational Speech: JSALT 2017
Tuesday, April 17, 16:00 - 18:00
Session: Speech Analysis

Leveraging LSTM Models for Overlap Detection in Multi-Party Meetings
Wednesday, April 18, 13:30 - 15:30
Session: Speaker Diarization & Identification

A Novel LSTM-based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions
Wednesday, April 18, 13:30 - 15:30
Session: Speaker Diarization & Identification

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!  

LDC at the Philadelphia Science Carnival
LDC will share the fun of language with the community on Saturday, April 28, with a booth at the Philadelphia Science Carnival. Visitors will enjoy three language-oriented educational activities that include a language identification game and Chinese character recognition.

The Philadelphia Science Carnival is an annual event organized by Philadelphia’s Franklin Institute to acquaint children and adults with the joys of science.


New publications:

(1) Concretely Annotated New York Times was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to The New York Times Annotated Corpus (LDC2008T19). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization. Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus.
Concretely Annotated New York Times is distributed via hard drive.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed The New York Times Annotated Corpus (LDC2008T19) may request a copy of Concretely Annotated New York Times (LDC2018T12) for a $250 media fee.  Non-members may license this data for a fee.

*

(2) H2, E2, ERK1 Children's Writing was developed by the Cooperative State University Baden-Württemberg, University of Education. It consists of approximately 2,000 texts written over four months by 173 German school children age six through eleven years. The data in this corpus was collected by elementary schools in Baden Württemberg, Germany, and digitized at the Cooperative State University during the 2016/2017 school year. Three second, third, and fourth grade classrooms participated in the collection. Texts were written within regular class settings. The students were presented with a picture and were asked to write a story to describe the picture or, if unable to write a text, to list what they saw in the picture.

There were 173 total participants. 100 students were multilingual, and further metadata is available for 166 of the 173 children. The following is included for each text in the database: school week of collection; school type; age; gender; grade/classroom; language spoken at home; and school materials used.

LDC has also released H1 Children's Writing (
LDC2016T01).

H2, E2, ERK1 Children's Writing is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) TRAD Arabic-French Parallel Text -- Newsgroup was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 10,000 Arabic words from GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (LDC2009T03). The PEA-TRAD project (Translation as a Support for Document Analysis) was supported by the French Ministry of Defense (DGA). Its purpose was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. This release consists of 398 segments (translations units) from 17 documents. The source data is Arabic newsgroup text collected and translated into English by LDC for the DARPA GALE (Global Autonomous Language Exploitation) program.

LDC has also released TRAD Chinese-French Parallel Text -- Blog (LDC2018T02).

TRAD Arabic-French Parallel Text -- Newsgroup is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.