Linguistic Data Consortium: Amharic

Showing posts with label Amharic. Show all posts

Friday, November 15, 2019

LDC 2019 November Newsletter

Join LDC for Membership Year 2020
Spring 2020 Data Scholarship Program

New Publications:
DEFT English Committed Belief Annotation
CALLFRIEND American English-Non-Southern Dialect Second Edition
TAC KBP Cold Start -Comprehensive Evaluation Data 2012-2017
IARPA Babel Amharic Language PackIARPA-babel307b-v1.0b

_________________________________________________________________________

Join LDC for Membership Year 2020

Membership Year 2020 (MY2020) is open and discounts are available for those who keep their membership current and join early in the year. Now through March 2, 2020, current MY2019 members who renew their LDC membership before March 2 will receive a 10% discount off the membership fee. New or returning organizations will receive a 5% discount through March 2.

In addition to receiving new publications, current LDC members also enjoy the benefit of licensing older data at reduced costs from our Catalog of over 800 holdings. Current-year for-profit members may use most data for commercial applications.

Plans for MY2020 publications are in progress. Among the expected releases are:

Abstract Meaning Representation (AMR) Annotation Release 3.0: semantic treebank of over 59,000 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums; updates the second version (LDC2017T10) with new annotations
TAC KBP: English sentiment slot filling, surprise slot filling, nugget detection and coreference, and event argument data in all languages (English, Chinese and Spanish)
DEFT Chinese ERE: Chinese discussion forum data annotated for entities, relations and events
LibriVox Spanish: 73 hours of Spanish audiobook read speech and transcripts
IARPA Babel Language Packs (telephone speech and transcripts): languages include Dhuluo, Javanese and Mongolian
HAVIC Med Training data: web video, metadata, and annotations for developing multimedia systems
RATS Speaker Identification: conversational telephone speech in Levantine Arabic, Pashto, Urdu, Farsi and Dari on degraded audio signals with annotation of speech segments for speaker identification
BOLT: discussion forums, SMS/chat, conversational telephone speech, word-aligned, tagged and co-reference data in all languages (Chinese, Egyptian Arabic, and English)

It’s also not too late to join for MY2018 (through December 31, 2019) and MY2019 (through December 31, 2020). Data sets from those years include Concretely Annotated New York Times and English Gigaword, DIRHA English WSJ Audio, BOLT English Treebank – Discussion Forum, First DIHARD Challenge Development and Evaluation releases, Penn Discourse Treebank Version 3.0, and 2016 NIST Speaker Recognition Evaluation Test Set.

For full descriptions of all LDC data sets, browse our Catalog.

Visit Join LDC for details on membership, user accounts and payment.

Spring 2020 Data Scholarship Program

Applications are now being accepted through January 15, 2020 for the Spring 2020 LDC Data Scholarship program which provides university students with no-cost access to LDC data. Consult the LDC Data Scholarship page for more information about program rules and submission requirements.

_________________________________________________________________________

New publications:

(1) DEFT English Committed Belief Annotation was developed by LDC and consists of approximately 950,000 words of English discussion forum text annotated for "committed belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text.

DARPA's Deep Exploration and Filtering of Text (DEFT) program aimed to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. LDC supported the DEFT program by collecting, creating and annotating a variety of data sources.

DEFT English Committed Belief Annotation is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) CALLFRIEND American English-Non-Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of non-Southern dialects of American English. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. The first edition is available as CALLFRIEND American English-Non-Southern Dialect (LDC96S46).

All data was collected before July 1997. Participants could speak with a person of their choice on any topic; most called family members and friends. All calls originated in North America. The recorded conversations last up to 30 minutes.

CALLFRIEND American English-Non-Southern Dialect Second Edition is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 was developed by LDC and contains Chinese, English and Spanish data produced in support of the TAC KBP Cold Start evaluation track conducted from 2012 to 2017. This corpus includes source documents, queries, assessments, manual runs and final assessments.

In the Cold Start track, systems were evaluated on their ability to construct a new knowledge base (KB) from information provided in a text collection in combination with technologies developed in other TAC KBP tracks -- slot filling, information extraction, question answering and entity discovery and linking. Cold Start systems were required to find all entities in the text, and the KB must have ideally included every person, organization, and geo-political entity as well as all the targeted relations between them. To facilitate the evaluation of those KBs, LDC annotators created sets of queries, human-generated responses to the queries, and assessments of both human and system responses.

The source data in this release is comprised of English and Spanish newswire and web text collected by LDC for the 2012, 2014 and 2015 evaluations and the 2016 pilot collection. The source collections for the 2016 and 2017 evaluations, which include Chinese data, are available in TAC KBP Evaluation Source Corpora 2016-2017 (LDC2019T12). The archived 2013 Cold Start source data collection is available from NIST upon request.

TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 is distributed via web download.

2019 Subscription Members will automatically receive copies of this corpus. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 204 hours of Amharic conversational and scripted telephone speech collected in 2014 along with corresponding transcripts.

The Amharic speech in this release represents the Addis Ababa, Shewa, and Gondar dialect regions of Ethiopia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Amharic Language Pack IARPA-babel307b-v1.0b is distributed via web download.

2019 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2019 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, February 16, 2018

LDC February 2018 Newsletter

Only two weeks left to enjoy 2018 membership discounts

Spring 2018 LDC Data Scholarship recipients

LDC data and commercial technology development

New Publications:
Multi-Language Conversational Telephone Speech 2011 -- Central Asian
LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text
TAC KBP Comprehensive English Source Corpora 2009-2014
IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e
______________________________________________________________

Only two weeks left to enjoy 2018 membership discounts

There is still time to save on 2018 membership fees. Through March 1, all organizations receive a discount on the 2018 membership fee (up to 10%) when they choose to join or renew.

For more information on membership benefits, visit Join LDC.

Spring 2018 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2018 Data Scholarship:

Margarida Madaleno: London School of Economics, PhD Economic Geography. Madelano is awarded a copy of Treebank 3 for her research in emotional well-being.

Gary Munnelly: Trinity College Dublin, PhD Computer Science and Statistics. Munnelly is awarded a copy of the New York Times Annotated Corpus for his research in named entity recognition and disambiguation in cultural heritage data sets.

Barlian Henryanu Prasetio: University of Miyazaki, PhD Environmental Robotics. Prasetio is awarded copies of SUSAS and SUSAS Transcripts for his work in voice stress recognition systems.

For information about the program, visit the Data Scholarship page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 – Central Asian was developed by LDC and is comprised of approximately 37 hours of telephone speech in three distinct language varieties of Central Asia: Dari, Farsi and Pashto.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

Slavic Group (LDC2016S11)
Turkish (LDC2017S09)
South Asian (LDC2017S14)

Multi-Language Conversational Telephone Speech 2011 – Central Asian is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text was developed by LDC and is comprised of approximately 25 million words of monolingual Amharic text, approximately 600,000 of which are translated into English. Another 80,000 words are also translated from English into Amharic. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.

Data was collected in the following genres: discussion forums, news, reference, social network and weblog. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods, which are detailed in the included documentation. All harvested content was initially converted from its original HTML form into a relatively uniform XML format. Also included in this release are two tools: one to recreate original source data from the processed XML material and the other to condition text data users download from Twitter.

LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by LDC and contains the 3,877,207 English source documents used in support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base.

The source data consists of newswire, broadcast material, and web text collected by LDC. Documents are released as a collection of zip files for overall compactness, and ease and efficiency of use. When unpacked, the documents are all UTF-8 text files with a basic markup structure.

TAC KBP Comprehensive English Source Corpora 2009-2014 is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(4) IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 200 hours of Tok Pisin conversational and scripted telephone speech collected in 2013 along with corresponding transcripts.

The Tok Pisin speech in this release represents that spoken in the Papuan dialect region of Papua New Guinea. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.