Linguistic Data Consortium: Tamil

Wednesday, March 15, 2023

LDC March 2023 Newsletter

LDC’s 30th anniversary year ends

LDC data and commercial technology development

New publications:

Mixer 3 Speech

LORELEI Tamil Representative Language Pack

________________________________________________________________

LDC’s 30th anniversary year ends

We hope you enjoyed the monthly data spotlights in celebration of LDC’s 30th anniversary year, April 2022-March 2023. We would not have achieved this milestone without the continued support and collaboration of our members, friends, and the community. We are grateful. As we enter our fourth decade, we pledge to continue to serve the community and our members by distributing high quality, diverse data and by providing top-notch member services and research program support.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit the Licensing page for further information.

New publications:

Mixer 3 Speech contains 3,200 hours of conversational telephone speech involving 3,875 speakers, 19,595 telephone recordings and 26 distinct languages. This material was collected by LDC from 2005-2007 as part of the Mixer project, and recordings in this corpus were used in NIST Speaker Recognition Evaluation and NIST Language Recognition Evaluation corpora, including 2006 SRE and 2007 LRE.

Recordings were generated using LDC's computer telephony system. Recruited speakers were connected through a robot operator to carry on casual conversations lasting up to 10 minutes. Subjects fluent in languages other than English were asked to complete at least one non-English call. Metadata includes the number of calls per subject and language as well as speaker demographic information.

2023 members can access this corpus through their LDC accounts. This corpus is a members-only release and is not available for non-member licensing. Contact ldc@ldc.upenn.edu for information about membership.

LORELEI Tamil Representative Language Pack is comprised of over 41 million words of Tamil monolingual text, 680,000 words of found Tamil-English parallel text, and 226,000 Tamil words translated from English data. Approximately 78,000 words were annotated for named entities and over 24,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

Wednesday, July 19, 2017

LDC July 2017 Newsletter

LDC at ACL 2017

Fall 2017 Data Scholarship Program

New corpora:

BOLT English Discussion Forums

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
KSUEmotions
Metalogue Multi-Issue Bargaining Dialogue
_________________________________________________________________________

LDC at ACL 2017: July 31-August 2, Vancouver, Canada

ACL has returned to North America and LDC is taking this opportunity to interact with top HLT researchers gathering in Vancouver, Canada. Stop by our exhibition table to learn more about recent developments at the Consortium and new publications.

Fall 2017 Data Scholarship Program

Student applications for the Fall 2017 LDC Data Scholarship program are being accepted now through Friday, September 15, 2017, 11:59PM EST. The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page.

Applicants can email their materials to the LDC Data Scholarship program.

New corpora

(1) BOLT English Discussion Forums was developed by LDC and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic processes.

The BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

The material in this release represents the unannotated English source data in the discussion forum genre. Collection was seeded based on the results of manual data scouting by native speaker annotators. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-English content. Language identification was performed on all threads in this corpus (using CLD2).

BOLT English Discussion Forums is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains 200 hours of Tamil conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Tamil speech in this release represents that spoken in the Northern, Central, Southern and Western dialect regions of the Indian state of Tamil Nadu. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b is distributed via web download.

2017 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2017 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) KSUEmotions was developed by King Saud University (KSU) and contains approximately five hours of emotional Modern Standard Arabic (MSA) speech from 23 subjects. Speakers were from three countries: Yemen, Saudi Arabia and Syria.

Subjects read MSA sentences from newswire text in the following emotions: neutral, anger, sadness, happiness, surprise, and interrogative (asking a question). Human reviewers then listened to the recordings to identify the emotion they heard. Audio was recorded in each participant's home.

KSUEmotions is distributed via web download.

(4) Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.

The goal of the Metalogue project was to develop a dialogue system with flexible dialogue management to enable the system's behavior in setting goals, choosing strategies and monitoring various processes. Six unique subjects (undergraduates between 19 and 25 years of age) were involved in a multi-issue bargaining scenario in which a representative of a city council and a representative of small business owners negotiated the implementation of new anti-smoking regulations. The negotiation involved four issues, each with four or five options. Participants received a preference profile for each scenario and negotiated for an agreement with the highest value based on their preference information. Negotiators were not allowed to accept an agreement with a negative value or to share their preference profiles with other participants.

The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction.

Metalogue Multi-Issue Bargaining Dialogue is distributed via web download.