Linguistic Data Consortium: membership discounts

Showing posts with label membership discounts. Show all posts

Monday, February 17, 2020

LDC 2020 February Newsletter

Only two weeks left to enjoy 2020 membership discounts

LREC Workshop on Citizen Linguistics - Deadline Extended

New Publications:

TAC KBP English Event Argument - Training and Evaluation Data 2014-2015
Chinese CogBank
Machine Reading Phase 1 ICTraining Data
IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b

__________________________________________________________________________
Only two weeks left to enjoy 2020 membership discounts

There is still time to save on 2020 membership fees. Through March 2, all organizations receive a discount on the 2020 membership fee (up to 10%) when they choose to join or renew. For more information on membership benefits, visit Join LDC.

LREC Workshop on Citizen Linguistics - Deadline Extended

LDC Researchers and their colleagues are organizing a workshop on Citizen Linguistics and Language Resource Development at LREC 2020 (Language Resource and Evaluation Conference) to take place on May 16, 2020. The workshop includes an open call for papers in language-related citizen science, a tutorial on using the new LanguageARC.org citizen linguistics portal and a special session on best papers using LanguageARC. Call for Papers deadline extended until February 24, 2020. _______________________________________________________________________

New publications:

(1) TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 was developed by LDC and contains training and evaluation data produced in support of the 2014 TAC KBP English Event Argument Extraction Pilot and Evaluation tasks and the 2015 English Event Argument Extraction and Linking Training and Evaluation tasks.

The Event Argument Extraction and Linking task required systems to extract event arguments (entities or attributes playing a role in an event) from unstructured text, indicate the role they play in an event, and link the arguments appearing in the same event to each other. Since the extracted information must be suitable as input to a knowledge base, systems constructed tuples indicating the event type, the role played by the entity in the event, and the most canonical mention of the entity from the source document. The event types and roles were drawn from an externally-specified ontology of 31 event types, which included financial transactions, communication events, and attacks.

This corpus includes source documents, manual runs, assessments, and event hoppers, a form of identity coreference for events (2015 only). Source data is English newswire and discussion forum text collected by LDC.

TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Chinese CogBank is a database of cognitive properties of Chinese words intended for use in metaphor understanding and generation. It consists of 232,497 "word-property" pairs, which are comprised of 83,104 words and 100,195 properties. Each "word-property" type also has an associated frequency which can stand as a functional measure of the importance of a property.

The data was collected via the Chinese search engine Baidu.com. The original collection consisted of 1,258,430 types (5,637,500 tokens) of "word-adjective" pairs that were reduced in Chinese CogBank to 232,497 "word-property" pairs after a series of manual checks.

Chinese CogBank is distributed via web download.

(3) Machine Reading Phase 1 IC Training Data was developed by LDC for use in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. It contains 248 English source documents and 116 standoff annotation files, annotated with instances of explicit relations and their arguments, as well as some non-explicit relations.

The Machine Reading program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the IC (Core Domain) task, which tested the core domain by extracting information about Entities (people, organizations, geopolitical entities) and their involvement in four types of Relations (Attack Relations, Biographical Relations, Affiliation Relations and Family Relations), as described in newswire text. This information was then aligned with an IC Use Cases ontology that would allow automated reasoning about the extracted Entities and Relations.

Machine Reading Phase 1 IC Training Data is distributed via web download.

(4) IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Dholuo conversational and scripted telephone speech collected in 2014 and 2015 along with corresponding transcripts.

The Dholuo speech in this release represents the South Nyanza and Trans-Yala dialect regions of Kenya. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Dholuo Language Pack IARPA-babel403b-v1.0b is distributed via web download.

2020 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, January 15, 2020

LDC 2020 January Newsletter

Renew Your LDC Membership Today
LREC Workshop for Citizen Linguistics – Call for Papers

New Publications:
Abstract Meaning Representation(AMR) Annotation Release 3.0
Database of Word Level Statistics – Mandarin
LibriVox Spanish
________________________________________________________________________

Renew Your LDC Membership Today

Join LDC for MY2020 while membership savings are still available. Now through March 2, 2020, renewing MY2019 members receive a 10% discount off the 2020 membership fee. New or returning member organizations receive a 5% discount. This year’s planned publications include Mixer 4 and 5 Speech (English telephone speech and interviews), IARPA Babel Language Packs (telephone speech and transcripts in underserved languages), and data from BOLT, DEFT, RATS, TAC KBP and more. Membership remains the most economical way to access LDC releases. Visit Join LDC for details on membership options and benefits.

LREC Workshop on Citizen Linguistics

________________________________________________________________________

New publications:

(1) Abstract Meaning Representation (AMR) Annotation Release 3.0 was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction and web text. This release updates Abstract Meaning Representation 2.0 (LDC2017T10) with new data, more annotations on new and prior data, new or improved PropBank-style frames, enhanced quality control, and multi-sentence annotations.

AMR captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.

Abstract Meaning Representation (AMR) Annotation Release 3.0 is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Database of Word Level Statistics – Mandarin was developed by The Hong Kong Polytechnic University. It provides lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese. It is designed for researchers particularly concerned with language processing of isolated words. Invariant characteristics include each item's lexicality, sampa, pinyin, IPA transcription, lexical tone, syllable structure, syllable length, pinyin length, segment length, dominant PoS, lexical frequency of the dominant PoS, percent of that dominant PoS, and other PoSes associated with the given item.

Database of Word Level Statistics – Mandarin is distributed via web download.

(3) LibriVox Spanish consists of approximately 73 hours of Spanish read speech and transcripts. The audio data was taken from Spanish audiobooks developed by LibriVox, a non-profit project that creates audiobooks from public domain works. The transcripts were developed for this release.

The audio is comprised of sentences from 300 books read by 154 speakers (77 men and 77 women), representing native and non-native Spanish read speech. Audio files were manually segmented and are between three and ten seconds in length. Native Spanish speakers transcribed the audio data.

LibriVox Spanish is distributed via web download.

2020 Subscription Members will automatically receive copies of this corpus. 2020 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Wednesday, December 16, 2015

LDC 2015 December Newsletter

Renew your LDC membership today

Spring 2016 LDC Data Scholarship Program - deadline approaching

LDC at LSA 2016

LDC to close for Winter Break

New publications

2006 CoNLL Shared Task - Arabic & Czech

2006 CoNLL Shared Task - Ten Languages

GALE Phase 3 Chinese Broadcast News Speech

GALE Phase 3 Chinese Broadcast News Transcripts

________________________________________________________________________

Renew your LDC membership today

Membership Year 2016 (MY2016) discounts are available for those who keep their membership current and join early in the year. Check here for further information including our planned publications for MY2016.

Now is also a good time to consider joining LDC for the current and open membership years, MY2015 and MY2014. MY2015 includes data such as RATS Speech Activity Detection and updates to Penn Treebank. MY2014 remains open through the end of the 2015 calendar year and its publications include UN speech data, 2009 NIST LRE test set, 2007 ACE multilingual data, and multi-channel WSJ audio. For full descriptions of these data sets, visit our Catalog.

Spring 2016 LDC Data Scholarship Program - deadline approaching

The deadline for the Spring 2016 LDC Data Scholarship Program is right around the corner! Student applications are being accepted now through January 15, 2016, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

Students will need to complete an application which consists of a data use proposal and letter of support from their adviser. For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarships program. Decisions will be sent by email from the same address.

LDC at LSA 2016

LDC will be exhibiting at the Annual Meeting of the Linguistic Society of America, held January 7-10, 2016 in Washington, DC. Stop by booth 110 to learn more about recent developments at the Consortium and new publications. Also, be on the lookout for the following presentations:

Satellite Workshop: Preparing Your Corpus for Archival Storage
Malcah Yaeger-Dror (University of Arizona) and Christopher Cieri (LDC)
Thursday, January 7, 2016 - 8:00am to 3:00pm, Salon 4

Broadening connections among researchers in linguistics and human language technologies
Jeff Good (University at Buffalo) and Christopher Cieri (LDC)
Friday, January 8, 2016 - 7:30am to 9:00am, Salon 1

Diachronic development of pitch contrast in Seoul Korean
Sunghye Cho (UPenn), Yong-cheol Lee (Cheongju University) and Mark Liberman (LDC)
Friday, January 8, 2016 - 2:00pm to 5:00pm, Salon 1

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!

LDC to close for Winter Break

LDC will be closed from Friday, December 25, 2015 through Friday, January 1, 2016 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Monday, January 4, 2016. Requests received for membership renewals and corpora during the Winter Break will be processed at that time.

New publications

(1) 2006 CoNLL Shared Task - Arabic & Czech consists of Arabic and Czech dependency treebanks used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing.

This corpus is cross listed with ELRA as ELRA-W0087.

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006, the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page.

This source data in this release consists principally of news and journal texts. The individual data sets are subsets of the following:

2006 CoNLL Shared Task - Arabic & Czech is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.

(2) 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish.

This corpus is cross listed and jointly released with ELRA as ELRA-W0086.

The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006 , the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page.

The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies.

The individual data sets are:

BulTreeBank (Bulgarian)
The Danish Dependency Treebank (Danish)
The Alpino Treebank (Dutch)
The TIGER Corpus (German)
Treebank Tuba-J/S (Japanese)
Floresta Sinta(c)tica (Portuguese)
Slovene Dependency Treebank, SDT V0.1 (Slovene)
Cast3LB (Spanish)
Talbanken05 (Swedish)
METU-Sabanci Turkish Treebank (Turkish)

(3) GALE Phase 3 Chinese Broadcast News Speech was developed by LDC and is comprised of approximately 150 hours of Mandarin Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 3 Chinese Broadcast News Transcripts (LDC2015T25).

The broadcast news recordings in this release feature news broadcasts focusing principally on current events from the following sources: Anhui TV, China Central TV (CCTV), Phoenix TV and Voice of America (VOA).

This release contains 279 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.

GALE Phase 3 Chinese Broadcast News Speech is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) GALE Phase 3 Chinese Broadcast News Transcripts was developed by LDC and contains transcriptions of approximately 150 hours of Chinese broadcast news speech collected in 2007 and 2008 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 3 Chinese Broadcast News Speech (LDC2015S13).

The broadcast news recordings for transcription feature news broadcasts focusing principally on current events from the following sources: Anhui TV, China Central TV (CCTV), Phoenix TV and Voice of America (VOA).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,933,695 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

GALE Phase 3 Chinese Broadcast News Transcripts is distributed via web download.

Tuesday, February 17, 2015

LDC 2015 February Newsletter

Only two weeks left to enjoy 2015 membership savings

New publications:
Avocado Research Email Collection
GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3
RATS Speech Activity Detection
_________________________________________________________________________

Only two weeks left to enjoy 2015 membership savings

There’s still time to save on 2015 membership fees. Now through March 2, all organizations will receive a 5% discount when they join for MY2015. MY2014 members are eligible for an additional 5% off the fee when they renew before March 2.

Don’t miss this savings opportunity. Secure your membership today for access to new corpora as well as discounts on our existing catalog of over 600 holdings. 2015 publications include the following:

CIEMPIESS - Mexican Spanish radio broadcast audio and transcripts
GALE Phase 3 and 4 data – all tasks and languages
Mandarin Chinese Phonetic Segmentation and Tone Corpus - phonetic segmentation and tone labels
RATS Speech Activity Detection – multilanguage audio for robust speech detection and language identification
SEAME - Mandarin-English code-switching speech

To join, create or sign into your LDC user account, select your preferred membership type from the Catalog, add the item to your bin and follow the check-out process. The Membership Office will apply any discounts. Alternatively, if you have already received a renewal invoice from LDC, you can simply pay against that.

For more information on the benefits of membership, visit Join LDC.

New publications

(1) Avocado Research Email Collection consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Leads", or system accounts such as "Conference Room Upper Canada".

The collection consists of the processed personal folders of these accounts with metadata describing folder structure, email characteristics and contacts, among others. It is expected to be useful for social network analysis, e-discovery and related fields.

The source data for the collection consisted of Personal Storage Table (PST) files for 282 accounts. A PST file is used by MS Outlook to store emails, calendar entries, contact details, and related information. Data was extracted from the PST files using libpst version 0.6.54. Three files produced no output and and are not included in the collection. Each account is referred to as a "custodian" although some of the accounts do not correspond to humans.

The collection is divided into metadata and text. The metadata is represented in XML, with a single top-level XML file listing the custodians, and then one XML file per custodian listing all items extracted from that custodian's PST files. The full XML tree can be read by loading the top-level file with an XML parser that handles directives. All XML metadata files are encoded in UTF-8. The text contains the extracted text of the items in the custodians' folders, with the extracted text for each item being held in a separate file. The text files are then zipped into a zip file per custodian.

Avocado Research Email Collection is distributed on 1 DVD-ROM. 2015 Subscription Members will automatically receive two copies of this corpus provided that they have completed the license agreement. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 was developed by LDC and contains 242,020 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program. Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging.

Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source broadcast conversation (BC) and broadcast news (BN) programming collected by LDC in 2008 and 2009. The distribution by genre, words, character tokens and segments appears below:

Language	Genre	Files	Words	CharTokens	Segments
Chinese	BC	92	67,354	101,032	2,714
Chinese	BN	34	93,992	140,988	3,314
Total		126	161,346	242,020	6,028

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging eight different types of links
Identifying, attaching, and tagging local-level unmatched words
Identifying and tagging sentence/discourse-level unmatched words
Identifying and tagging all instances of Chinese 的 (DE) except when they were a part of a semantic link

GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 is distributed via web download. 2015 Subscription Members will automatically receive two copies of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) RATS Speech Activity Detection was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.

The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously.

Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers.

The source audio consists of conversational telephone speech recordings collected by LDC: (1) data collected for the RATS program from Levantine Arabic, Farsi, Pashto and Urdu speakers; and (2) material from the Fisher English (LDC2004S13, LDC2005S13), and Fisher Levantine Arabic telephone studies (LDC2007S02), as well as from CALLFRIEND Farsi (LDC2014S01).

Annotation was performed in three steps. LDC's automatic speech activity detector was run against the audio data to produce a speech segmentation for each file. Manual first pass annotation was then performed as a quick correction of the automatic speech activity detection output. Finally, in a manual second pass annotation step, annotators reviewed first pass output and made adjustments to segments as needed.

All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers.

RATS Speech Activity Detection is distributed on 1 hard drive. 2015 Subscription Members will automatically receive one copy of this corpus. 2015 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.