Linguistic Data Consortium: March 2018

New Publications:

______________________________________________________________________

New publications:

(1) BOLT Arabic Discussion Forums was developed by LDC and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The material in this release represents the unannotated Arabic source data in the discussion forum genre.

Collection was seeded based on the results of manual data scouting by native speaker annotators. Scouts were instructed to seek content in Egyptian Arabic that was original, interactive and informal. Upon locating an appropriate thread, scouts submitted the URL and some simple judgments about it to a database, via a web browser plug-in. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Arabic content. It should also be noted that many threads may contain a mixture of Egyptian and other varieties of Arabic, even among the threads that are primarily Arabic.

BOLT Arabic Discussion Forums is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) LORELEI Somali Representative Language Pack - Monolingual and Parallel Text was developed by LDC and is comprised of approximately 13 million words of monolingual Somali text, approximately 800,000 of which are translated into English. Another 100,000 words are also translated from English into Somali. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks.

Data was collected in the following genres: discussion forums, news, reference, social network and weblog. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods, which are detailed in the included documentation. All harvested content was initially converted from its original HTML form into a relatively uniform XML format. Also included in this release are two tools: one to recreate original source data from the processed XML material and the other to condition text data users download from Twitter.

LORELEI Somali Representative Language Pack - Monolingual and Parallel Text is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) SPADE (Syntactic Phrase Alignment Dataset for Evaluation) consists of annotated parse trees and alignment on English sentential paraphrases extracted from machine translation evaluation corpora and separated into development and test sets.

Reference translations from machine translation evaluation corpora were used as sentential paraphrases. They were sourced from the following data sets released by LDC from the NIST (National Institute of Standards and Technology) open machine translation evaluation series (OpenMT): LDC2010T14, LDC2010T17, LDC2010T21, and LDC2013T03.

Reference translations of 10 to 30 words were randomly extracted for annotation from NIST OpenMT corpora. Gold standard annotations of HPSG (head-driven phrase structure grammar) trees and phrase alignments were performed, resulting in 20,276 phrases extracted from 201 sentential paraphrases and 15,721 paraphrase alignments.

SPADE is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Thursday, March 15, 2018

LDC March 2018 Newsletter