Showing posts with label newswire. Show all posts
Showing posts with label newswire. Show all posts

Friday, December 15, 2023

LDC December 2023 Newsletter

LDC 2024 membership discounts now available  

Approaching deadline for Spring 2024 data scholarship applications

LDC closed for Winter Break Dec. 25-Jan. 1

New publications:

Kasdi-Merbah (University) Emotional Database in Arabic Speech

TAC-KBP Belief and Sentiment – Comprehensive Training and Evaluation Data 2016-2017
______________________________________________________________

LDC 2024 membership discounts now available 

Now through March 1, 2024, current 2023 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC for details on membership options and benefits. 

Approaching deadline for Spring 2024 data scholarship applications

Attention students: don’t miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2024 data scholarships are due January 15, 2024. For more information on requirements and program rules, see LDC Data Scholarships

LDC closed for Winter Break Dec. 25-Jan. 1 

LDC will be closed from Monday, December 25, 2023 through Monday, January 1, 2024 in accordance with the University of Pennsylvania Winter Break Policy. Our offices will reopen on Tuesday, January 2, 2024. Requests received by the Membership Office during Winter Break will be processed when the office reopens. 

New publications:
 
Kasdi-Merbah (University) Emotional Database in Arabic Speech was developed by the University of Kasdi Merbah Ouargla and contains two hours of Modern Standard Arabic prompted speech from 500 speakers (254 female, 246 male) representing 5,000 utterances. Each speaker read ten sentences, with two sentences each for five different emotions (sadness, fear, anger, happiness, neutral).

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.

*
 
TAC-KBP Belief and Sentiment – Comprehensive Training and Evaluation Data 2016-2017 includes all training and evaluation data developed by LDC for the Belief and Sentiment tracks: source documents (Chinese, English, and Spanish newswire and discussion forums); gold standard entity, relation, and event annotation; and belief and sentiment annotation.

The goal of the TAC-KBP Belief and Sentiment track was to provide information about beliefs and sentiments held by entities toward other entities, as well as toward events and relations. The gold standard set of labeled entities, relations, and events was used to create a system for automatically labeling belief and sentiment about each possible target (entity, relation or event) and for identifying the entity holding the belief or sentiment. 

2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee. 

Monday, October 15, 2018

LDC 2018 October Newsletter

In this newsletter: 

Fall 2018 LDC Data Scholarship Recipients
Membership Year 2019 Publication Preview

New Publications:
Concretely Annotated English Gigaword
TRAD Arabic-French Parallel Text -- Newswire
TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014
__________________________________________________________________________

Fall 2018 LDC Data Scholarship Recipients

Congratulations to the recipients of LDC's Fall 2018 Data Scholarships:

Utkrist Adhikari: University of Bonn (Germany); M.Sc, Computer Science. Utkrist is awarded a copy of Treebank-2 for his research in named entity recognition, super sense tagging, and semantic role labeling. 

Vitaliya Remneva: Higher School of Economics, National Research University (Russia); M.Sc, System and Software Engineering. Vitaliya is awarded a copy of ETS Corpus of Non-Native Written English for her work in author profiling through natural language processing.

Tian Xiaoyu: Shanghai International Studies University (China); MA, Linguistics. Tian is awarded a copy of Tagged Chinese Gigaword Version 2.0 for her research in causative construction variations in Mainland Chinese, Taiwan Chinese, and Singapore Chinese. 

W. Victor H. Yarlott: Florida International University (US); Ph.D., School of Computing and Information Sciences. Victor is awarded a copy of ACE2005 Multilingual Training Corpus for his research in relation extraction. 

For information about the program, visit the Data Scholarship page. 

Membership Year 2019 Publication Preview

The 2019 Membership Year is fast approaching and plans for next year’s publications are in progress. Among the expected releases are:

SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle school students performing collaborative learning tasks, includes audio recordings, orthographic transcriptions, manual annotation of collaboration, and related documentation
Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal University and Brandeis University, semantic representation of approximately 10,000 Chinese sentences following the basic principles of AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)
Multilanguage conversational telephone speech: developed to support language identification research in related languages (Arabic, East Asian, English, Mandarin)
TAC KBP: English entity discovery and linking, nugget detection and event argument data, Chinese slot-filling data
IARPA Babel Language Packs (telephone speech and transcripts): languages include Amharic, Guarani, Igbo, and Lithuanian
HAVIC Med Progress Test data: web video, metadata, and annotations for developing multimedia systems
BOLT: discussion forums, SMS, word-aligned and tagged data in all languages (Chinese, Egyptian Arabic, English)

Check your inbox in the coming weeks for more information about membership renewal.  

New publications:

(1) Concretely Annotated English Gigaword was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic, and coreference annotations to English Gigaword Fifth Edition (LDC2011T07). Concrete is a schema for representing structured, hierarchical, and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization.

Concretely Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition, which consists of newswire stories from seven sources collected by LDC between 1994-2010. 

Concretely Annotated English Gigaword is distributed via hard drive.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Any organization that licensed English Gigaword Fifth Edition (LDC2011T07) or Annotated English Gigaword (LDC2012T21) may request a copy of Concretely Annotated English Gigaword for a media fee. Non-members may license this data for a fee.


*

(2) TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 was developed by LDC and contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014.
Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. 

The regular English Slot Filling evaluation track involved mining information about entities from text. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person and organization entities and attempted to return all valid answers (slot fillers) in the source collection. For more information about English Slot Filling, please refer to the 2014 track home page.

This release contains queries, the 'manual runs' (human-produced responses to the queries), and the final rounds of assessment results. 

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


*

(3) TRAD Arabic-French Parallel Text -- Newswire  was developed by ELDA as part of the PEA-TRAD project. It contains French translations of a subset of approximately 20,000 Arabic words from NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21). The purpose of the PEA-TRAD project (Translation as a Support for Document Analysis) was to develop speech-to-speech translation technology for multiple languages (e.g., Arabic, Chinese, Pashto) from a variety of domains. 

This release consists of 813 segments (translations units) from 74 documents. The Arabic source file contains 19,902 words and the French reference translation contains 29,104 words.  The source data is Arabic newswire text collected and translated into English by LDC. Information about the ELDA translation team, translation guidelines, and validation results is contained in the documentation accompanying this release.

TRAD Arabic-French Parallel Text -- Newswire is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2018 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.


Tuesday, November 17, 2015

LDC 2015 November Newsletter

Invitation to Join for Membership Year (MY) 2016

Commercial use and LDC data

Spring 2016 Data Scholarship Program

LDC closed for Thanksgiving Break

New publications:

Invitation to Join for Membership Year (MY) 2016

Membership Year (MY) 2016 is open for joining.  We would like to invite all current and previous members of LDC to renew their membership as well as welcome new organizations to join the Consortium.  For MY2016, LDC is pleased to maintain membership fees at last year’s rates – membership fees will not increase.  Additionally, LDC will extend discounts on membership fees to members who keep their membership current and who join early in the year.

The details of our early renewal discounts for MY2015 are as follows: 
  • Organizations who joined for MY2015 will receive a 10% discount when renewing before March 1, 2016. After March 1, 2016, MY2015 members are eligible for a 5% discount when renewing through the end of the year.
  • New members as well as organizations who did not join for MY2015, but who held membership in any of the previous MYs (1993-2014), will also be eligible for a 5% discount provided that they join/renew before March 1, 2016.

Publications for MY2016 are still being planned but we plan to release the following:
  • Arabic Treebank - Weblog ~ part-of-speech/morphological annotation and syntactic tree annotation of web text from various sources
  • BOLT ~ all phases, languages, genres, tasks
  • DEFT ~ Spanish and Chinese resources
  • Digital Archive of Southern Speech - NLP Version ~ colloquial speech in the Southern United States; NLP version normalizes filenames and formats
  • GALE Phase 3 and 4 data ~ all tasks and languages    
  • HAVIC ~ amateur video and transcripts
  • NewSoMe Corpus of Opinion in Blogs ~ opinion annotated English and Spanish blogs


Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases.  Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose.  LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for further information.
Spring 2016 Data Scholarship Program
Applications are now being accepted through Friday, January 15, 2016 for the Spring 2016 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost.  This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a proposal describing their intended use of the data. The proposal should state which data the student plans to use and how the data will benefit their research project as well as information on the proposed methodology or algorithm.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must verify the student's need for data and confirm that the department or university lacks the funding to pay the full nonmember fee for the data or to join the Consortium.

For further information on application materials and program rules, please visit the LDC Data Scholarship page.

Students can email their applications to the LDC Data Scholarship program. Decisions will be sent by email from the same address.

The deadline for the Spring 2016 program cycle is January 15, 2016.

LDC closed for Thanksgiving Break

LDC will be closed on Thursday, November 26, 2015 and Friday, November 27, 2015 in observance of the US Thanksgiving Holiday.  Our offices will reopen on Monday, November 30, 2015.

New publications
(1) Articulation Index LSCP was developed by researchers at Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises and enhances a subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons speaking English syllables. Changes include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions. AIC consists of 20 American English speakers (12 males, 8 females) pronouncing syllables, some of which form actual words, but most of which are nonsense syllables. All possible Consonant-Vowel (CV) and Vowel-Consonant (VC) combinations were recorded for each speaker twice, once in isolation and once within a carrier-sentence, for a total of 25768 recorded syllables.
Articulation Index LSCP alters AIC in the following ways.
  1. Time-alignments for the onset and offset of each word and syllable were generated through forced-alignment with a standard HMM-GMM (Hidden Markov Model-Gaussian Mixture Model) ASR system.
  2. The time-alignments for the beginning and end of the syllables (whether in isolation or within a carrier sentence) were manually adjusted. The time-alignments for the other words in carrier sentences were not manually adjusted
  3. The recordings of isolated syllables were cut according to the manual time-alignments to remove the silent portions at the beginning and end, and the time-alignments were altered to correspond to the cut recordings
  4. The file naming scheme was slightly altered for compatibility with the Kaldi speech recognition toolkit.de
  5. AIC contains a wide-band (16 KHz, 16-bit PCM) and a narrow-band (8 KHz, 8 bit u-law) version of the recordings distributed in sphere format. The LSCP version contains the wide-band version only distributed as wave files.
Articulation Index LSCP is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  This data is being made available at no-cost for non-member organizations under a research license.
*
(2) GALE Phase 4 Chinese Newswire Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from newswire data collected by LDC in 2008 and translated by LDC or under its direction.

GALE Phase 4 Chinese Newswire Parallel Sentences includes 627 source-translation document pairs, comprising 90,434 tokens of Chinese source text and its English translation. Data is drawn from six distinct Chinese newswire sources.

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts.  Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 4 Chinese Newswire Parallel Sentences is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) KHATT: Handwritten Arabic Text was developed by King Fahd University of Petroleum & Minerals, Technical University of Dortmund and Braunschweig University of Technology. It is comprised of scanned Arabic handwriting from 1,000 distinct male and female writers representing diverse countries, age groups, handedness and education levels. Participants produced text on a topic of their choice in an unrestricted style. KHATT was designed to promote research in areas such as text recognition and writer identification.

The majority of participants were natives of Saudi Arabia; the next largest group was from a collection of regional countries (Egypt, Jordan, Kuwait, Morocco, Palestine, Tunisia and Yemen). Most writers were between 16-25 years of age with high school or university qualifications.

KHATT: Handwritten Arabic Text is distributed on one USB drive.

2015 Subscription Members will automatically receive a copy of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


Friday, October 16, 2015

LDC 2015 October Newsletter

In this newsletter:
Fall 2015 LDC Data Scholarship recipients

New publications:
GALE Phase 4 Chinese  Broadcast News Parallel Sentences
Karlsruhe Children's Text


Fall 2015 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Fall 2015 data scholarships:
Anthony Beylerian - Keio University (Japan), MSc, Informatics and Computer Science.  Anthony has been awarded a copy of OntoNotes for his work in word sense disambiguation.
Siti Binte Faizal - Newcastle University (UK), PhD candidate, Speech and Language Sciences.  Siti has been awarded a copy of Levantine Arabic QT Training Speech and Text for her work in psycholinguistics.
Sara El-Kafrawy - Ain Shams University (Egypt), MSc candidate, Computer and Information Sciences.  Sara has been awarded a copy of GALE Arabic English Word Alignment and Arabic Gigaword for her work in machine translation.
Marwa Hadj Salah - University of Sfax (Tunisia), PhD candidate, Computer Science.  Marwa has been awarded a copy of Arabic English Parallel News and Arabic News Translation Text for her work in machine translation.
Tomoaki Goto - University of Tokyo (Japan), PhD candidate, Linguistics.  Tomoaki has been awarded a copy of Arabic Newswire English Translation for his work in syntax.
Richard Metzger - Pennsylvania State University (USA), PhD candidate, Electrical Engineering.  Richard has been awarded a copy of 2008 NIST Speaker Recognition Training Part 2 and Test for his work in speaker recognition.
Jun Ren - Massey University (New Zealand), PhD, Engineering.  Jun has been awarded a copy of TORGO Dysarthric Articulation for his work in speaker recognition.
Gozde Sahin - Istanbul Technical University (Turkey), PhD candidate, Computer Engineering and Informatics.  Gozde has been been awarded a copy of 2009 CoNLL Parts 1 and 2 for her work in semantic role labeling.
Alexey Sholokhov - University of Eastern Finland (Finland), PhD candidate, Computer Sciences.  Alexey has been awarded a copy of RATS Speech Activity Detection for his work in speaker verification.
Stefan Watson - University of the West Indies (Jamaica),  PhD candidate, Physics.  Stefan has been awarded a copy of CMU Kids for his work in phonology and speech recognition.
For program information visit the Data Scholarship page. 

New publications         
(1) ACE2007 Spanish DevTest - Pilot Evaluation was developed by LDC. This publication contains the complete set of Spanish development and test data to support the 2007 Automatic Content Extraction (ACE) technology evaluation, namely, newswire data annotated for entities and temporal expressions.

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.

LDC has also released ACE 2007 Multilingual Training Corpus (LDC2014T18) which contains the Arabic and Spanish training data used in the 2007 evaluation.

The data consists of newswire material published in May 2005 from the following sources: Agence France Press, The Associated Press and Xinhua News Agency.

All files were annotated by two human annotators working independently. Discrepancies between the two annotations were adjudicated by a senior team member resulting in a gold standard file.

There are three annotation directories for each newswire story that contain an identical copy of the source text in SGML format and two associated annotated versions in XML format and tab delimited format. All text is UTF-8 encoded.

ACE 2007 Spanish DevTest - Pilot Evaluation is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALEPhase 4 Chinese Broadcast News Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source sentences and corresponding English translations selected from broadcast news data collected by LDC in 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 4 Chinese Broadcast News Parallel Sentences includes 40 source-translation document pairs, comprising 156,429 tokens of Chinese source text and its English translation. Data is drawn from eight distinct Chinese programs broadcast in 2008 from China Central TV, a national and international broadcaster in Mainland China; and Voice of America, a U.S. government-funded broadcast programmer. The programs in this release feature news programs on current events topics.

The data was transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.txt. All data are encoded in UTF-8.

GALE Phase 4 Chinese Broadcast News Parallel Sentences is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) KarlsruheChildren's Text was developed by the Cooperative State University Baden-WĂĽrttemberg, University of Education and Karlsruhe Institute of Technology. It consists of over 14,000 freely written, German sentences from more than 1,700 school children in grades one through eight.

The data collection was conducted in 2011-2013 at elementary and secondary schools in and around Karlsruhe, Germany. Students were asked to write as verbose a text as possible. Those in grades one to four were read two stories and were then asked to write their own stories. Students in grades five through eight were instructed to write on a specific theme, such as "Imagine the world in 20 years. What has changed?”. The goal of the collection was to use the data to develop a spelling error classification system.

Annotators converted the handwritten text into digital form with all errors committed by the writers; they also created an orthographically correct version of every sentence. Metadata about the text was gathered, including the circumstances under which it was collected, information about the student writer and background about spelling lessons in the particular class. In a second step, the students' spelling errors were annotated into general groupings: grapheme level, syllable level, morphology and syntax. The files were anonymized in a third step.

This release also contains metadata regarding the writers’ language biography, teaching methodology, age, gender and school year. The average age of the participants was 11 years, and the gender distribution was nearly equal. Original handwriting is presented as JPEG format image files and the converted annotated text as UTF-8 plain text. Metadata is contained within each text file.

Karlsruhe Children's Text is distributed via web download.

2015 Subscription Members will automatically receive two copies of this corpus.  2015 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Monday, September 22, 2014

LDC 2014 September Newsletter


LDC at Interspeech 2014, Singapore

New publications:


LDC at Interspeech 2014, Singapore

LDC is off to Singapore to participate in Interspeech 2014. This year’s conference will be held from September 14-18 at Singapore’s Max Atria at the Expo Center. Please stop by LDC’s exhibition booth to learn more about recent developments at the Consortium and new publications. LDC will continue to post conference updates via our Facebook page. We hope to see you there!   
 
New publications

(1) ACE 2007 Multilingual Training Corpus was developed by LDC and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.

The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task.

The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM).

Arabic
Words


Files




1P
2P
NORM
1P
2P
NORM
NW
58,015
58,015
58,015
257
257
257
WL
40,338
40,338
40,338
121
121
121
Total
98,353
98,353
98,353
378
378
378
Spanish






Words


Files




1P
2P
NORM
1P
2P
NORM
NW
100,401
100,401
100,401
352
352
352
Total
100,401
100,401
100,401
352
352
352

For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p" and "timex2norm". In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). All files are presented in UTF-8.

ACE 2007 Multilingual Training Corpus is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(2) GALE Arabic-English Word Alignment -- Broadcast Training Part 1 was developed by LDC and contains 267,257 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source broadcast news and broadcast conversation data collected by LDC from 2007-2009. The distribution by genre, words, tokens and segments appears below:

Language
Genre
Files
Words
Tokens
Segments
Arabic
BC
231
79,485
103,816
4,114
Arabic
BN
92
131,789
163,441
7,227
Totals

323
211,274
267,257
11,341

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:
Normalizing tokenized tokens as needed
Identifying different types of links
Identifying sentence segments not suitable for annotation
Tagging unmatched words attached to other words or phrases

GALE Arabic-English Word Alignment -- Broadcast Training Part 1 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc.  2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*
(3) GALE Phase 2 Chinese Newswire Parallel Text Part 2 was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains 117,895 tokens of Chinese source text and corresponding English translations selected from newswire data collected by LDC in 2007 and translated by LDC or under its direction.

This release includes 177 source-translation document pairs, comprising 117,895 tokens of translated data. Data is drawn from four distinct Chinese newswire sources: China News Service, Guangming Daily, People's Daily and People's Liberation Army Daily.

Data was manually selected for translation according to several criteria, including linguistic features and topic features. The files were formatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment. Each field in the TDF file is described in TDF_format.text. All data are encoded in UTF-8.

GALE Phase 2 Chinese Newswire Parallel Text Part 2 is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

Thursday, May 15, 2014

LDC May 2014 Newsletter

LDC at LREC 2014

New publications:
GALE Arabic-English Word Alignment Training Part 2 -- Newswire  
Hispanic-English Database  
HyTER Networks of Selected OpenMT08/09 Progress Set Sentences  



LDC at LREC 2014
LDC will attend the 9th Language Resource Evaluation Conference (LREC2014), hosted by ELRA, the European Language Resource Association. The conference will be held in Reykjavik, Iceland from May 26-31 and features a broad range of sessions on language resource and human language technologies research. Ten LDC staff members will be presenting current work on topics including the language application grid project, collecting natural SMS and chat conversations in multiple languages, incorporating alternate translations into English translation treebanks, supporting HLT research with degraded audio data, developing an Egyptian Arabic Treebank and more.

Following the conference LDC’s presented papers and posters will be available on LDC’s Papers Page


New publications
(1) GALE Arabic-English Word Alignment Training Part 2 -- Newswire was developed by LDC and contains 162,359 tokens of word aligned Arabic and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Arabic source newswire collected by LDC in 2004 - 2006 and 2008. The distribution by genre, words, character tokens and segments appears below:

Language
Genre
Files
Words
CharTokens
Segments
Arabic
NW
1,126
112,318
162,359
5,349

Note that word count is based on the untokenized Arabic source, and token count is based on the tokenized Arabic source.

The Arabic word alignment tasks consisted of the following components:
  • Identifying and correcting incorrectly tokenized tokens
  • Identifying different types of links
  • Identifying sentence segments not suitable for annotation, such as those that were blank, incorrectly-segmented or containing other languages
  • Tagging unmatched words attached to other words or phrases
GALE Arabic-English Word Alignment Training Part 2 -- Newswire is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


*

(2) Hispanic-English Database contains approximately 30 hours of English and Spanish conversational and read speech with transcripts (24 hours) and metadata collected from 22 non-native English speakers between 1996 and 1998. The corpus was developed by Entropic Research Laboratory, Inc, a developer of speech recognition and speech synthesis software toolkits that was acquired by Microsoft in 1999.

Participants were adult native speakers of Spanish as spoken in Central America and South America who resided in the Palo Alto, California area, had lived in the United States for at least one year and demonstrated a basic ability to understand, read and speak English. They read a total of 2200 sentences, 50 each in Spanish and English per speaker. The Spanish sentence prompts were a subset of the materials in LATINO-40 Spanish Read News, and the English sentence prompts were taken from the TIMIT database. Conversations were task-oriented, drawing on exercises similar to those used in English second language instruction and designed to engage the speakers in collaborative, problem-solving activities.

Read speech was recorded on two wideband channels with a Shure SM10A head-mounted microphone in a quiet laboratory environment. The conversational speech was simultaneously recorded on four channels, two of which were used to place phone calls to each subject in two separate offices and to record the incoming speech of the two channels into separate files. The audio was originally saved under the Entropic Audio (ESPS) format using a 16kHz sampling rate and 16 bit samples. Audio files were converted to flac compressed .wav files from the ESPS format. ESPS headers were removed and are presented in this release as *.hdr files that include demographic and technical data.

Transcripts were developed with the Entropic Annotator tool and are time-aligned with speaker turns. The transcription conventions were based on those used in the LDC Switchboard and CALLHOME collections. Transcript files are denoted with a .lab extension.

Hispanic-English Database is distributed on 1 DVD-ROM.

2014 Subscription Members will automatically receive two copies of this data. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) HyTER Networks of Selected OpenMT08/09 Progress Set Sentences was developed by SDL and contains HyTER (Hybrid Translation Edit Rate) networks for 102 selected source Arabic and Chinese sentences from OpenMT08 and OpenMT09 Progress Set data. HyTER is an evaluation metric based on large reference networks created by an annotation tool that allows users to develop an exponential number of correct translations for a given sentence. Reference networks can be used as a foundation for developing improved machine translation evaluation metrics and for automating the evaluation of human translation efficiency.

The source material is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. Annotators created meaning-equivalent annotations under three annotation protocols. In the first protocol, foreign language native speakers built English networks starting from foreign language sentences. In the second, English native speakers built English networks from the best translation of a foreign language sentence as identified by NIST (National Institute of Standards and Technology). In the third protocol, English native speakers built English networks starting from the best translation, but those annotators also had access to three additional, independently produced human translations. Networks created by different annotators for each sentence were combined and evaluated.

This release includes the source sentences and four human reference translations produced by LDC in XML format, along with five machine translation system outputs representing a variety of system architectures and performance, and the human post-edited output of those systems also presented in XML.

HyTER Networks of Selected OpenMT08/09 Progress Set Sentences is distributed via web download.

2014 Subscription Members will automatically receive two copies of this data on disc. 2014 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.