Linguistic Data Consortium: June 2018

LDC Catalog certified as CoreTrustSeal data repository

LDC data and commercial technology development

New Publications:

Multi-Language Conversational Telephone Speech 2011 -- Central European

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013

IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b
__________________________________________________________________________

LDC Catalog certified as CoreTrustSeal data repository

LDC is pleased to announce that the Catalog has been awarded the CoreTrustSeal for recognition as a trustworthy data repository. This means that the Catalog meets a series of standards covering data access, rights management, curation, and storage developed by the ISCU World Data System and the Data Seal of Approval. LDC joins the other 136 certified repositories around the globe in the commitment to promote sustainable and trustworthy data infrastructures.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

(1) BOLT Chinese SMS/Chat was developed by LDC and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The corpus contains 14,877 conversations totaling 3,005,810 words across 497,543 messages.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources – discussion forums, text messaging, and chat – in Chinese, Egyptian Arabic, and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking, and co-reference. The data in this release was collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants.

BOLT Chinese SMS/Chat is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) Multi-Language Conversational Telephone Speech 2011 -- Central European was developed by LDC and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak.

The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation (LRE). Participants were recruited by native speakers who contacted acquaintances in their social network. Those native speakers made one call, up to 15 minutes, to each acquaintance. Human auditors labeled the calls for callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language Conversational Telephone Speech 2011 series:

· Slavic Group (LDC2016S11)

· Turkish (LDC2017S09)

· South Asian (LDC2017S14)

· Central Asian (LDC2018S03)

Multi-Language Conversational Telephone Speech 2011 -- Central European is distributed via web download.

(3) TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010, 2011, 2012, and 2013. It includes queries and gold standard entity type information, Knowledge Base links, and equivalence class clusters for NIL entities. Also included are the source documents for the queries, specifically, English newswire, discussion forum, and web data. The corresponding knowledge base is available as TAC KBP Reference Knowledge Base (LDC2014T16). Also included in this package are the results of an Entity Linking IAA (Inter-Annotator Agreement) study conducted in 2010.

TAC KBP encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. English Entity Linking was first conducted as part of the 2009 TAC KBP evaluations. Its goal is to measure systems' ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base (KB) and, if so, to create a link between the two. If there is no matching node for a query entity in the KB, EL systems are required to cluster the mention together with others referencing the same entity.

TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 is distributed via web download.

(4) IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 191 hours of Cebuano conversational and scripted telephone speech collected in 2013 and 2014 along with corresponding transcripts.

The Cebuano speech in this release represents that spoken in the Cebu-North Kana, Sialo, and Mindanao dialect regions of the Philippines. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b is available via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Linguistic Data Consortium

Monday, June 18, 2018

LDC 2018 June Newsletter