Linguistic Data Consortium: speech

Showing posts with label speech. Show all posts

Monday, July 15, 2019

LDC 2019 July Newsletter

In this newsletter:

Fall 2019 LDC Data Scholarship Program

LDC data and commercial technology development

New Publications:
The DKU-JNU-EMA Electromagnetic Articulography Database
Phrase Detectives Corpus Version 2
First DIHARD Challenge Evaluation - Nine Sources
First DIHARD Challenge Evaluation – SEEDLingS
__________________________________________________________

Fall 2019 LDC Data Scholarship Program

Student applications for the Fall 2019 LDC Data Scholarship program are being accepted now through September 15, 2019. This scholarship program provides eligible students with access to LDC data at no cost. Students must complete an application consisting of a data use proposal and letter of support from their advisor.

For application requirements and program rules, please visit the LDC Data Scholarship page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.
__________________________________________________________

New publications:

(1) The DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for each dialect.

Articulatory measurements were made using the NDI electromagnetic articulography wave research system to capture real-time vocal tract variable trajectories. Subjects had six sensors placed in various locations in their mouth and one reference sensor was placed on the bridge of their nose. For simultaneous recording of speech signals, subjects also wore a head-mounted close-talk microphone.

Speakers engaged in four different types of recording sessions: one in which they read complete sentences or short texts, and three sessions in which they read related words of a specific common consonant, vowel or tone.

DKU-JNU-EMA Electromagnetic Articulography Database is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $1000.

(2) Phrase Detectives Corpus Version 2 was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the Phrase Detectives Game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference.

This release constitutes a new version of the Phrase Detectives Corpus (LDC2017T08), adding significantly more annotated tokens to the data set and supplying players’ judgments and a silver label annotation based on the probabilistic aggregation method for anaphoric information for each markable.

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. The annotation is a simplified form of the coding scheme used in The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Phrase Detectives Corpus Version 2 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data at no cost.

(3) First DIHARD Challenge Evaluation - Nine Sources was developed by LDC and contains approximately 18 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge.

The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions as follows (all sources are in English unless otherwise indicated):

Autism Diagnostic Observation Schedule (ADOS) interviews
Conversations in Restaurants
DCIEM/HCRC map task (LDC96S38)
Audiobook recordings from LibriVox
Meeting speech collected by LDC in 2001 for the ROAR project (see, e.g., ISL Meeting Speech Part 1 (LDC2004S05))
2001 U.S. Supreme Court oral arguments
Mixer 6 Speech (LDC2013S02)
Chinese video collected by LDC as part of the Video Annotation for Speech Technologies (VAST) project
YouthPoint radio interviews

This release, when combined with First DIHARD Challenge Evaluation - SEEDLingS (LDC2019S13), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation - Nine Sources is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $300.

(4) First DIHARD Challenge Evaluation – SEEDLingS was developed by Duke University and LDC and contains approximately two hours of English child language recordings along with corresponding annotations used in support of the First DIHARD Challenge.

The source data was drawn from the SEEDLingS (The Study of Environmental Effects on Developing Linguistic Skills) corpus, designed to investigate how infants' early linguistic and environmental input plays a role in their learning. Recordings for SEEDLingS were generated in the home environment of 44 infants from 6-18 months of age in the Rochester, New York area. A subset of that data was annotated by LDC for use in the First DIHARD Challenge.

This release, when combined with First DIHARD Challenge Evaluation - Nine Sources (LDC2019S12), contains the evaluation set audio data and annotation as well as the official scoring tool. The development data for the First DIHARD Challenge is also available from LDC as Eight Sources (LDC2019S09) and SEEDLingS (LDC2019S10).

First DIHARD Challenge Evaluation – SEEDLingS is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for $50.

Monday, May 16, 2016

LDC May 2016 Newsletter

LDC at LREC 2016

New publications:

SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing

GALE Phase 4 Chinese Broadcast Conversation Speech
GALE Phase 4 Chinese Broadcast Conversation Transcripts
_______________________________________________________________

LDC at LREC 2016

LDC will attend the 10th Language Resource Evaluation Conference (LREC2016), hosted by ELRA, the European Language Resource Association. The conference will be held in Portorož, Slovenia from May 23-28 and features a broad range of sessions on language resources and human language technologies research. Seven LDC staff members will be presenting current work on topics including trends in HLT research, building language resources for autism spectrum disorders, data management plans, rapid development of morphological analyzers for typologically diverse languages, selection criteria for low resource language programs, multi-language speech collection for NIST LRE, novel incentives for collecting data and annotation from people, and more.

Following the conference, LDC’s presented papers and posters will be available on LDC’s Papers Page.

New Corpora

(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing consists of data, tools, system results, and publications associated with the 2014 and 2015 tasks on Broad-Coverage Semantic Dependency Parsing (SDP) conducted in conjunction with the International Workshop on Semantic Evaluation (SemEval) and was developed by the SDP task organizers.

SemEval is an ongoing series of evaluations of computational semantic analysis systems intended to explore the nature of meaning in language. It evolved from the Senseval word sense disambiguation series to include semantic analysis tasks outside of word sense disambiguation.

This release is based on English, Chinese and Czech data from the following resources: Treebank-2 LDC95T17, Proposition Bank I LDC2004T14, NomBaank v 1.0 LDC2008T23 and CCGBank LDC2005T13 (English); Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21) (Chinese); and Prague Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01) (Czech).

The results are presented as graphs in three target representations: MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures (PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional target representation CCGbank was converted to semantic dependency graphs (in the subdirectory ‘ccd’).

SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 4 Chinese Broadcast Conversation Speech was developed by LDC and is comprised of approximately 172 hours of Mandarin Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast Conversation Transcripts (LDC2016T12).

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 236 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 4 Chinese Broadcast Conversation Speech is distributed via web download.

(3) GALE Phase 4 Chinese Broadcast Conversation Transcripts was developed by LDC and contains transcriptions of approximately 172 hours of Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast Conversation Speech (LDC2016S03).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 2,259,952 tokens.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR). QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. QRTR adds additional structural information such as topic boundaries and manual sentence unit annotation.

GALE Phase 4 Chinese Broadcast Conversation Transcripts is distributed via web download.

Wednesday, December 16, 2015

LDC 2015 December Newsletter

Renew your LDC membership today

Spring 2016 LDC Data Scholarship Program - deadline approaching

LDC at LSA 2016

LDC to close for Winter Break

New publications

2006 CoNLL Shared Task - Arabic & Czech

2006 CoNLL Shared Task - Ten Languages

GALE Phase 3 Chinese Broadcast News Speech

GALE Phase 3 Chinese Broadcast News Transcripts

________________________________________________________________________

Renew your LDC membership today

Membership Year 2016 (MY2016) discounts are available for those who keep their membership current and join early in the year. Check here for further information including our planned publications for MY2016.

Now is also a good time to consider joining LDC for the current and open membership years, MY2015 and MY2014. MY2015 includes data such as RATS Speech Activity Detection and updates to Penn Treebank. MY2014 remains open through the end of the 2015 calendar year and its publications include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. For full descriptions of these data sets, visit our Catalog.

Spring 2016 LDC Data Scholarship Program - deadline approaching

The deadline for the Spring 2016 LDC Data Scholarship Program is right around the corner! Student applications are being accepted now through January 15, 2016, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.

LDC at LSA 2016

LDC will be exhibiting at the Annual Meeting of the Linguistic Society of America, held January 7-10, 2016 in Washington, DC. Stop by booth 110 to learn more about recent developments at the Consortium and new publications. Also, be on the lookout for the following presentations:

Satellite Workshop: Preparing Your Corpus for Archival Storage
Malcah Yaeger-Dror (University of Arizona) and Christopher Cieri (LDC)
Thursday, January 7, 2016 - 8:00am to 3:00pm, Salon 4

Broadening connections among researchers in linguistics and human language technologies
Jeff Good (University at Buffalo) and Christopher Cieri (LDC)
Friday, January 8, 2016 - 7:30am to 9:00am, Salon 1

Diachronic development of pitch contrast in Seoul Korean
Sunghye Cho (UPenn), Yong-cheol Lee (Cheongju University) and Mark Liberman (LDC)
Friday, January 8, 2016 - 2:00pm to 5:00pm, Salon 1

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

LDC to close for Winter Break

LDC will be closed from Friday, December 25, 2015 through Friday, January 1, 2016 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Monday, January 4, 2016. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.

New publications

(1) 2006 CoNLL Shared Task - Arabic & Czech consists of Arabic and Czech dependency treebanks used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing.

This corpus is cross listed with ELRA as ELRA-W0087.

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006, the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page.

This source data in this release consists principally of news and journal texts. The individual data sets are subsets of the following:

2006 CoNLL Shared Task - Arabic & Czech is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.

(2) 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish.

This corpus is cross listed and jointly released with ELRA as ELRA-W0086.

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006 , the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page.

The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies.

The individual data sets are:

BulTreeBank (Bulgarian)
The Danish Dependency Treebank (Danish)
The Alpino Treebank (Dutch)
The TIGER Corpus (German)
Treebank Tuba-J/S (Japanese)
Floresta Sinta(c)tica (Portuguese)
Slovene Dependency Treebank, SDT V0.1 (Slovene)
Cast3LB (Spanish)
Talbanken05 (Swedish)
METU-Sabanci Turkish Treebank (Turkish)

(3) GALE Phase 3 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 150 hours of Mandarin Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast News Transcripts (LDC2015T25).

The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: Anhui TV, China Central TV (CCTV), Phoenix TV and Voice of America (VOA).

This release contains 279 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 3 Chinese Broadcast News Speech is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) GALE Phase 3 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 150 hours of Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 3 Chinese Broadcast News Speech (LDC2015S13).

The broadcast news recordings for transcription feature news broadcasts focusing principally on current events from the following sources: Anhui TV, China Central TV (CCTV), Phoenix TV and Voice of America (VOA).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,933,695 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

GALE Phase 3 Chinese Broadcast News Transcripts is distributed via web download.

Tuesday, February 17, 2015

LDC 2015 February Newsletter

Only two weeks left to enjoy 2015 membership savings

New publications:
Avocado Research Email Collection
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3
RATS Speech Activity Detection
_________________________________________________________________________

Only two weeks left to enjoy 2015 membership savings

There’s still time to save on 2015 membership fees. Now through March 2, all organizations will receive a 5% discount when they join for MY2015. MY2014 members are eligible for an additional 5% off the fee when they renew before March 2.

Don’t miss this savings opportunity. Secure your membership today for access to new corpora as well as discounts on our existing catalog of over 600 holdings. 2015 publications include the following:

CIEMPIESS - Mexican Spanish radio broadcast audio and transcripts
GALE Phase 3 and 4 data – all tasks and languages
Mandarin Chinese Phonetic Segmentation and Tone Corpus - phonetic segmentation and tone labels
RATS Speech Activity Detection – multilanguage audio for robust speech detection and language identification
SEAME - Mandarin-English code-switching speech

To join, create or sign into your LDC user account, select your preferred membership type from the Catalog, add the item to your bin and follow the check-out process. The Membership Office will apply any discounts. Alternatively, if you have already received a renewal invoice from LDC, you can simply pay against that.

For more information on the benefits of membership, visit Join LDC.

New publications

(1) Avocado Research Email Collection consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Leads", or system accounts such as "Conference Room Upper Canada".

The collection consists of the processed personal folders of these accounts with metadata describing folder structure, email characteristics and contacts, among others. It is expected to be useful for social network analysis, e-discovery and related fields.

The source data for the collection consisted of Personal Storage Table (PST) files for 282 accounts. A PST file is used by MS Outlook to store emails, calendar entries, contact details, and related information. Data was extracted from the PST files using libpst version 0.6.54. Three files produced no output and and are not included in the collection. Each account is referred to as a "custodian" although some of the accounts do not correspond to humans.

The collection is divided into metadata and text. The metadata is represented in XML, with a single top-level XML file listing the custodians, and then one XML file per custodian listing all items extracted from that custodian's PST files. The full XML tree can be read by loading the top-level file with an XML parser that handles directives. All XML metadata files are encoded in UTF-8. The text contains the extracted text of the items in the custodians' folders, with the extracted text for each item being held in a separate file. The text files are then zipped into a zip file per custodian.

Avocado Research Email Collection is distributed on 1 DVD-ROM. 2015 Subscription Members will automatically receive two copies of this corpus provided that they have completed the license agreement. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 was developed by LDC and contains 242,020 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging.

Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:

Language	Genre	Files	Words	CharTokens	Segments
Chinese	BC	92	67,354	101,032	2,714
Chinese	BN	34	93,992	140,988	3,314
Total		126	161,346	242,020	6,028

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging eight different types of links
Identifying, attaching, and tagging local-level unmatched words
Identifying and tagging sentence/discourse-level unmatched words
Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 is distributed via web download. 2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) RATS Speech Activity Detection was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously.

Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers.

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic, Farsi, Pashto and Urdu speakers; and (2) material from the Fisher English (LDC2004S13, LDC2005S13), and Fisher Levantine Arabic telephone studies (LDC2007S02), as well as from CALLFRIEND Farsi (LDC2014S01).

Annotation was performed in three steps. LDC's automatic speech activity detector was run against the audio data to produce a speech segmentation for each file. Manual first pass annotation was then performed as a quick correction of the automatic speech activity detection output. Finally, in a manual second pass annotation step, annotators reviewed first pass output and made adjustments to segments as needed.

All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers.

RATS Speech Activity Detection is distributed on 1 hard drive. 2015 Subscription Members will automatically receive one copy of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, January 20, 2015

LDC 2015 January Newsletter

LDC Membership Discounts for MY 2015 Still Available

New publications:

GALE Phase 2 Arabic Broadcast News Speech Part 2

GALE Phase 2 Arabic Broadcast News Transcripts Part 2

SenSem Databank

LDC Membership Discounts for MY 2015 Still Available

If you are considering joining LDC for Membership Year 2015 (MY2015), there is still time to save on membership fees. Any organization which joins or renews membership for 2015 through Monday, March 2, 2015, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2014 can receive a 10% discount on fees provided they renew prior to March 2, 2015. For further information on planned publications for MY2015, please visit or contact LDC.

New publications

GALE Phase 2 Arabic Broadcast News Speech Part 2 was developed by LDC and is comprised of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast News Transcripts Part 1 (LDC2015T01).

Broadcast audio for the GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology, Hong King (Chinese), Medianet (Tunis, Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast recordings in this release feature news programs focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, based in Dubai, United Arab Emirates; Al Iraqiyah, a television network based in Iraq; Kuwait TV, a national television station based in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

This release contains 204 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 2 Arabic Broadcast News Speech Part 2 is distributed on 3 DVD-ROM.

GALE Phase 2 Arabic Broadcast News Transcripts Part 2 was developed by LDC and contains transcriptions of approximately 170 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. Corresponding audio data is released as GALE Phase 2 Arabic Broadcast News Speech Part 2 (LDC2015S01).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 920,730 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Arabic Broadcast News Transcripts Part 2 is distributed via web download.

SenSem (Sentence Semantics) Databank was developed by GRIAL, the Linguistic Applications Inter-University Research Group that includes the following Spanish institutions: the Universitat Autonoma de Barcelona, the Universitat de Barcelona, the Universitat de Lleida and the Universitat Oberta de Catalunya. It contains syntactic and semantic annotation for over 35,000 sentences, approximately one million words of Spanish and approximately 700,000 words of Catalan translated from the Spanish. GRIAL's work focuses on resources for applied linguistics, including lexicography, translation and natural language processing.

Each sentence in SenSem Databank was labeled according to the verb sense it exemplifies, the type of complement it takes (arguments or adjuncts) and the syntactic category and function. Each argument was also labeled with a semantic role. Further information about the SenSem project can be obtained from the GRIAL website.

The Spanish source data includes texts from news journals (30,000 sentences) and novels (5,299 sentences). Those sentences represent around 1,000 different verb meanings that correspond to the 250 most frequent Spanish verbs. Verb frequencies were retrieved from a quantitative analysis of around 13 million words.

The Catalan corpus was developed by translating the news journal portion of the Spanish data set, resulting in a resource of over 700,000 sentences from which 391,267 sentences were annotated. Sentences were automatically translated and manually post-edited; some were re-annotated for sentence complements. Semantic information was the same for both languages. The Catalan sentences represent close to 1,300 different verbs.

SenSem Databank is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus on disc. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. This data is made available to LDC not-for-profit members and all non-members under the Creative Commons Attribution-Noncommercial Share Alike 3.0 license and to LDC for-profit members under the terms of the For-Profit Membership Agreement.

Friday, August 15, 2014

LDC 2014 August Newsletter

Fall 2014 LDC Data Scholarship program- September 15 deadline approaching

Neural Engineering Data Consortium publishes first release

New publications:

GALE Phase 2 Arabic Broadcast News Speech Part 1

GALE Phase 2 Arabic Broadcast News Transcripts Part 1

TAC KBP Reference Knowledge Base

Fall 2014 LDC Data Scholarship program- September 15 deadline approaching!

Student applications for the Fall 2014 LDC Data Scholarship program are being accepted now through Monday, September 15, 2014, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDCData Scholarship page.

Applicants can email their materials to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

Neural Engineering Data Consortium publishes first release

The Neural Engineering Data Consortium (NEDC) has announced its first release, the Temple University Hospital Electroencephalogram (TUH EEG) corpus. TUH EEG corpus is a database of over 20,000 EEG recordings which will aid the development of technology to automatically interpret EEG scans. NEDC, directed by Professors Iyad Obeid and Joe Picone of Temple University, Philadelphia, PA USA , designs, collects and distributes data and resources in support of neural engineering research

NEDC is surveying community needs to help set priorities for future effort. You can complete the survey here.

New publications

(1) GALE Phase 2 Arabic Broadcast News Speech Part 1 was developed by LDC and is comprised of approximately 165 hours of Arabic broadcast news speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 2 Arabic Broadcast News Transcripts Part 1 (LDC2014T17).

The broadcast recordings in this release feature news programs focusing principally on current events from the following sources: Abu Dhabi TV, a televisions station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Alhurra, a U.S. government-funded regional broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Dubai TV, a broadcast station in the United Arab Emirates; Al Iraqiyah, an Iraqi television station; Kuwait TV, a national broadcast station in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria.

This release contains 200 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program’s genre, data type and topic.

GALE Phase 2 Arabic Broadcast News Speech Part 1 is distributed on three DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 2 Arabic Broadcast News Transcripts Part 1 was developed by LDC and contains transcriptions of approximately 165 hours of Arabic broadcast news speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. Corresponding audio data is released as GALE Phase 2 Arabic Broadcast News Speech Part 1 (LDC2014S07).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 897,868 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

GALE Phase 2 Arabic Broadcast News Transcripts Part 1 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TAC KBP Reference Knowledge Base was developed by LDC in support of the NIST-sponsored TAC-KBP evaluation series. It is a knowledge base built from English Wikipedia articles and their associated infoboxes and covers over 800,000 entities.

TAC (Text Analysis Conference) is a series of workshops organized by NIST (the National Institute of Standards and Technology) to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. TAC's KBP track (Knowledge Base Population) encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

Consult the LDC TAC-KBP project page for further information about LDC's resource development for the TAC-KBP program.

The source data, Wikipedia infoboxes and articles, was taken from an October 2008 snapshot of Wikipedia.

TAC KBP Reference Knowledge Base contains a set of entities, each with a canonical name and title for the Wikipedia page, an entity type, an automatically parsed version of the data from the infobox in the entity's Wikipedia article, and a stripped version of the text of the Wiki article. Each entity is assigned one of four types: PER (person), ORG (organization), GPE (geo-political entity) and UKN (unknown). All data files are presented as UTF-8 encoded XML.

TAC KBP Reference Knowledge Base is distributed on one DVD-ROM.