Linguistic Data Consortium

Monday, April 15, 2013

LDC April 2013 Newsletter

Checking in with LDC Data Scholarship Recipients

New publications:
GALE Phase 2 Chinese Broadcast Conversation Speech
GALE Phase 2 Chinese Broadcast Conversation Transcripts
NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets

Checking in with LDC Data Scholarship Recipients

The LDC Data Scholarship program provides college and university students with access to LDC data at no-cost. Students are asked to complete an application which consists of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. LDC introduced the Data Scholarship program during the Fall 2010 semester. Since that time, more than thirty individual students and student research groups have been awarded no-cost copies of LDC data for their research endeavors. Here is an update on the work of a few of the student recipients:

Leili Javadpour - Louisiana State University (USA), Engineering Science. Leili was awarded a copy of BBN Pronoun Coreference and Entity Type Corpus (LDC2005T33) and Message Understanding Conference (MUC) 7 (LDC2001T02) for her work in pronominal anaphora resolution. Leili's research involves a learning approach for pronominal anaphora resolution in unstructured text. She evaluated her approach on the BBN Pronoun Coreference and Entity Type Corpus and obtained encouraging results of 89%. In this approach machine learning is applied to a set of new features selected from other computational linguistic research. Leili's future plans involve evaluating the approach on Message Understanding Conference (MUC) 7 as well as on other genres of annotated text such as stories and conversation transcripts.

Olga Nickolaevna Ladoshko - National Technical University of Ukraine “KPI” (Ukraine), graduate student, Acoustics and Acoustoelectronics. Olga was awarded copies of NTIMT (LDC93S2) and STC-TIMIT 1.0 (LDC2008S03) for her research in automatic speech recognition for Ukrainian. Olga used NTIMIT in the first phase of her research; one problem she investigated was the influence of telephone communication channels on the reliability of phoneme recognition in different types of parametrization and configuration speech recognition systems on the basis of HTK tools. The second phase involves using NTIMIT to test the algorithm for determining voice in non-stationary noise. Her future work with STC-TIMIT 1.0 will include an experiment to develop an improved speech recognition algorithm, allowing for increased accuracy under noisy conditions.
Genevieve Sapijaszko - University of Central Florida (USA), Phd Candidate, Electrical and Computer Engineering. Genevieve was awarded a copy TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) and YOHO Speaker Verification (LDC94S16) for her work in digital signal processing. Her experiment used VQ and Euclidean distance to recognize a speaker's identity through extracting the features of the speech signal by the following methods: RCC, MFCC, MFCC + ΔMFCC, LPC, LPCC, PLPCC and RASTA PLPCC. Based on the results, in a noise free environment MFCC, (at an average of 94%), is the best feature extraction method when used in conjunction with the VQ model. The addition of the ΔMFCC showed no significant improvement to the recognition rate. When comparing three phrases of differing length, the longer two phrases had very similar recognition rates but the shorter phrase at 0.5 seconds had a noticeable lower recognition rate across methods. When comparing recognition time, MFCC was also faster than other methods. Genevieve and her research team concluded that MFCC in a noise free environment was the best method in terms of recognition rate and recognition rate time.
John Steinberg - Temple University (USA), MS candidate, Electrical and Computer Engineering. John was awarded a copy of CALLHOME Mandarin Chinese Lexicon (LDC96L15) and CALLHOME Mandarin Chinese Transcripts (LDC96T16) for his work in speech recognition. John used the CALLHOME Mandarin Lexicon and Transcripts to investigate the integration of Bayesian nonparametric techniques into speech recognition systems. These techniques are able to detect the underlying structure of the data and theoretically generate better acoustic models than typical parametric approaches such as HMM. His work investigated using one such model, Dirichlet process mixtures, in conjunction with three variational Bayesian inference algorithms for acoustic modeling. The scope of his work was limited to a phoneme classification problem since John's goal was to determine the viability of these algorithms for acoustic modeling.

One goal of his research group is to develop a speech recognition system that is robust to variations in the acoustic channel. The group is also interested in building acoustic models that generalize well across languages. For these reasons, both CALLHOME English and CALLHOME Mandarin data were used to help determine if these new Bayesian nonparametric models were prone to any language specific artifacts. These two languages, though phonetically very different, did not yield significantly different performances. Furthermore, one variational inference algorithm- accelerated variational Dirichlet process mixtures (AVDPM) - was found to perform well on extremely large data sets.

New publications

(1) GALE Phase 2 Chinese Broadcast Conversation Speech (LDC2013S04) was developed by LDC and is comprised of approximately 120 hours of Chinese broadcast conversation speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding transcripts are released as GALE Phase 2 Chinese Broadcast Conversation Transcripts (LDC2013T08).

Broadcast audio for the GALE program was collected at the Philadelphia, PA USA facilities of LDC and at three remote collection sites: HKUST (Chinese) Medianet, Tunis, Tunisia (Arabic) and MTC, Rabat, Morocco (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; Hubei TV, a regional broadcaster in Mainland China, Hubei Province; and Phoenix TV, a Hong Kong-based satellite television station. A table showing the number of programs and hours recorded from each source is contained in the readme file.

This release contains 202 audio files presented in Waveform Audio File format (.wav), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of the broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about the genre, data type and topic of a program.

GALE Phase 2 Chinese Broadcast Conversation Speech is distributed on 4 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data fora fee.

(2) GALE Phase 2 Chinese Broadcast Conversation Transcripts (LDC2013T08) was developed by LDC and contains transcriptions of approximately 120 hours of Chinese broadcast conversation speech collected in 2006 and 2007 by LDC and Hong University of Science and Technology (HKUST), Hong Kong, during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. Corresponding audio data is released as GALE Phase 2 Chinese Broadcast Conversation Speech (LDC2013S04).

The source broadcast conversation recordings feature interviews, call-in programs and round table discussions focusing principally on current events from the following sources: Anhui TV, a regional television station in Mainland China, Anhui Province; China Central TV (CCTV), a national and international broadcaster in Mainland China; Hubei TV, a regional broadcaster in Mainland China, Hubei Province; and Phoenix TV, a Hong Kong-based satellite television station.

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 1,523,373 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Chinese Broadcast Conversation Transcripts is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(3) NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets (LDC2013T07) was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English progress test sets for the NIST OpenMT 2008, 2009, and 2012 evaluations. The test data remained unseen between evaluations and was reused unchanged each time. The package was compiled, and scoring software was developed, at NIST, making use of Chinese and Arabic newswire and web data and reference translations collected and developed by LDC.

The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original.

The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

This release contains 2,748 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. The table below displays statistics by source, genre, documents, segments and source tokens.

Source	Genre	Documents	Segments	Source Tokens
Arabic	Newswire	84	784	20039
Arabic	Web Data	51	594	14793
Chinese	Newswire	82	688	26923
Chinese	Web Data	40	682	19112

NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Friday, March 15, 2013

LDC March 2013 Newsletter

LDC’s 20th Anniversary: Concluding a Year of Celebration

New publications:

1993-2007 United Nations Parallel Text

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web

LDC’s 20th Anniversary: Concluding a Year of Celebration

We’ve enjoyed celebrating our 20th Anniversary this last year (April 2012 - March 2013) and would like to review some highlights before its close.

Our 2012 User Survey, circulated early in 2012, included a special Anniversary section in which respondents were asked to reflect on their opinions of, and dealings with, LDC over the years. We were humbled by the response. Multiple users mentioned that they would not be able to conduct their research without LDC and its data. For a full list of survey testimonials, please click here.

LDC also developed its first-ever timeline (initially published in the April 2012 Newsletter) marking significant milestones in the consortium’s founding and growth.

In September, we hosted a 20th Anniversary Workshop that brought together many friends and collaborators to discuss the present and future of language resources.

Throughout the year, we conducted several interviews of long-time LDC staff members to document their unique recollections of LDC history and to solicit their opinions on the future of the Consortium. These interviews are available as podcasts on the LDC Blog.

As our Anniversary year draws to a close, one task remains – to thank all of LDC’s past, present and future members and other friends of the Consortium for their loyalty and for their contributions to the community. LDC would not exist if not for its supporters. The variety of relationships that LDC has built over the years is a direct reflection of the vitality, strength and diversity of the community. We thank you all and hope that we continue to serve your needs in our third decade and beyond.

For a last treat, please visit LDC’s newly-launched YouTube channel to enjoy this video montage of the LDC staff interviews featured in the podcast series.

Thank you again for your continued support!

New publications

(1) 1993-2007 United Nations Parallel Text was developed by Google Research. It consists of United Nations (UN) parliamentary documents from 1993 through 2007 in the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish.

UN parliamentary documents are available from the UN Official Document System (UN ODS). UN ODS, in its main UNDOC database, contains the full text of all types of UN parliamentary documents. It has complete coverage datng from 1993 and variable coverage before that. Documents exist in one or more of the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS also contains a large number of German documents, marked with the language other, but these are not included in this dataset.

LDC has released parallel UN parliamentary documents in English, French and Spanish spanning the period 1988-1993, UN Parallel Text (Complete) (LDC94T4A).

The data is presented as raw text and word-aligned text. There are 673,670 raw text documents and 520,283 word aligned documents. The raw text is very close to what was extracted from the original word processing documents in UN ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding. The word-aligned text was normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential chunk-pairs, and then aligned at the word. The sentence, chunk, and word alignment operations were performed separately for each individual language pair.

1993-2007 United Nations Parallel Text is distributed on 3 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data provided they have completed the UN Parallel Text Corpus User Agreement. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web was developed by LDC and contains 158,387 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source web data (newsgroup, weblog) collected by LDC between 2005-2010. The distribution by words, character tokens and segments appears below:

Language	Files	Words	CharTokens	Segments
Chinese	1,224	105,591	158,387	4,836

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging 8 different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, March 7, 2013

LDC Timeline: 1992 - 2012

LDC Timeline – Two Decades of Milestones

April 15, 2012 marked the “official” 20th anniversary of LDC’s founding. As our Anniversary year draws to a close, LDC would like to share with the blogging community a brief timeline of some significant milestones.

1992: The University of Pennsylvania is chosen as the host site for LDC in response to a call for proposals issued by DARPA; the mission of the new consortium is to operate as a specialized data publisher and archive guaranteeing widespread, long-term availability of language resources. DARPA provides seed money with the stipulation that LDC become self-sustaining within five years. Mark Liberman assumes duties as LDC’s Director with a staff that grows to four, including Jack Godfrey, the Consortium’s first Executive Director.
1993: LDC’s catalog debuts. Early releases include benchmark data sets such as TIMIT, TIPSTER, CSR and Switchboard, shortly followed by the Penn Treebank.
1994: LDC and NIST (the National Institute of Standards and Technology) enter into a Cooperative R&D Agreement that provides the framework for the continued collaboration between the two organizations.
1995: Collection of conversational telephone speech and broadcast programming and transcription commences. LDC begins its long and continued support for NIST common task evaluations by providing custom data sets for participants. Membership and data license fees prove sufficient to support LDC operations, satisfying the requirement that the Consortium be self-sustaining.
1996: The Lexicon Development Project, under the direction of Dr. Cynthia McLemore, begins releasing pronouncing lexicons in Mandarin, German, Egyptian Colloquial Arabic, Spanish, Japanese and American English. By 1997, all are published.
1997: LDC announces LDC Online, a searchable index of newswire and speech data with associated tools to compute n-gram models, mutual information and other analyses.
1998: LDC adds annotation to its task portfolio. Christopher Cieri joins LDC as Executive Director and develops the annotation operation.
1999: Steven Bird joins LDC; the organization begins to develop tools and best practices for general use. The Annotation Graph Toolkit results from this effort.
2000: LDC expands its support of common task evaluations from providing corpora to coordinating language resources across the program. Early examples include the DARPA TIDES, EARS and GALE programs.
2001: The Arabic treebank project begins.
2002: LDC moves to its current facilities at 3600 Market Street, Philadelphia with a full-time staff of approximately 40 persons.
2004: LDC introduces the Standard and Subscription membership options, allowing members to choose whether to receive all or a subset of the data sets released in a membership year.
2005: LDC makes task specifications and guidelines available through its projects web pages.
2008: LDC introduces programs that provide discounts for continuing members and those who renew early in the year.
2010: LDC inaugurates the Data Scholarship program for students with a demonstrable need for data.
2012: LDC’s full-time staff of 50 and 196 part-time staff support ongoing projects and operations which include collecting, developing and archiving data, data annotation, tool development, sponsored-project support and multiple collaborations with various partners. The general catalog contains over 500 holdings in more than 50 languages. Over 85,000 copies of more than 1300 titles have been distributed to over 3200 organizations in 70 countries.

2012 User Survey Testimonials

As LDC's 20th Anniversary Year draws to a close, we would like to take this opportunity to share a few more Anniversary year activities with you.

In early 2012, LDC circulated a user survey to recent members and data licensees. Part of this survey focused on our then forthcoming Anniversary year and asked if respondents would provide anonymous testimonials supporting LDC. We are happy to report that many respondents took part and you may browse a selection of their comments below. Many humored LDC by playing along with the suggestion to describe the Consortium in one word or to compare LDC to a color, fruit or animal. LDC was humbled by the outpouring of support and would like to again thank all of our members and the entire community for continuously supporting the Consortium's existence.

2012 LDC User Survey Testimonials

· If LDC did not exist, it would have to be invented. It provides critical resources for the speech technology community.

· I wish that I were more ambitious and could use all of the datasets the LDC provides!

· I recently started as a new Assistant Professor in an undergraduate college with little access to research funds. The LDC staff bent over backward to allow me access to the materials I needed without the budget of a research university.

· Thanks for the good work.

· Keep on publishing.

· Timely and competent follow-up from LDC staff regarding any queries or problems

· I like LDC because they are very professional, very responsive, charge reasonable fees and have very friendly and helpful personnel

· Researchers in public institutions need organizations like LDC.

· Happy birthday LDC. Keep up the good work!

· Congratulations for your hard work, and for sharing tools with the world

· LDC is a great speech corpus provider for worldwide languages.

· LDC is best of breed in providers of high quality curated textual data, including some very large data sets.

· LDC is a great resource for researchers - keeping up with the times with new databases each year.

· (Organization name withheld) would like to extend sincere greetings to the LDC and to its great team, and a sincere "THANK YOU" for the wonderful service you have provided. May we celebrate your 100th anniversary!

· I like LDC because they provide good service at a reasonable price for academic institutions.

· There's no data like more data, and LDC is where it's at.

· I like LDC because it relieves us from troublesome negotiations with each provider of language resources.

· LDC is great. If it were a color, it would be teal (very hip).

· Blue as the sea because it helps researcher irrigate their research ideas.

· I would like to consider LDC as watermelon for its skin is green which is the symbol of flourishing life, the pulp is red which is the symbol of hope and success and the black seed is the essence of cohesion. In all, for researchers, LDC is very essential.

· Fruit: pomegranate, single body, many multiple frutties

· Description of LDC in 7 words: many corpora of a very high quality

· Describe LDC in one word: Astronomical

Monday, February 18, 2013

LDC February 2013 Newsletter

Spring 2013 LDC Data Scholarship Recipients!

Membership Fee Savings and Publications Pipeline

New LDC Podcast, LDC Executive Director, Christopher Cieri

New publications:

GALE Phase 2 Arabic Broadcast Conversation Speech Part 1

GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1

NIST 2012 Open Machine Translation (OpenMT) Evaluation

Spring 2013 LDC Data Scholarship Recipients!

LDC is pleased to announce the student recipients of the Spring 2013 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen three proposals to support. The following students will receive no-cost copies of LDC data:

Salima Harrat - Ecole Supérieure d’informatique (ESI) (Algeria). Salima has been awarded a copy of Arabic Treebank: Part 3 for her work in diacritization restoration.

Maulik C. Madhavi - Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar (India). Maulik has been awarded a copy of Switchboard Cellular Part 1 Transcribed Audio and Transcripts and 1997 HUB4 English Evaluation Speech and Transcripts for his work in spoken term detection.

Shereen M. Oraby - Arab Academy for Science, Technology, and Maritime Transport (Egypt). Shereen has been awarded a copy of Arabic Treebank: Part 1 for her work in subjectivity and sentiment analysis.

Please join us in congratulating our student recipients! The next LDC Data Scholarship program is scheduled for the Fall 2013 semester.

Membership Fee Savings and Publications Pipeline

Time is quickly running out to save on membership fees for Membership Year 2013 (MY2013)! Any organization which joins or renews membership for 2013 through Friday, March 1, 2013, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2012 can receive a 10% discount on fees provided they renew prior to March 1, 2013.

Many publications for MY2013 are still in development. The planned publications for the upcoming months include:

GALE data ~ continuing releases of all languages (Arabic, Chinese, English), genres (Broadcast News, Broadcast Conversation, Newswire and Web Data) and tasks (Parallel Text, Word Alignment, Parallel Aligned Treebanks, Parallel Sentences, Audio and Transcripts).

Hispanic Accented English Database ~ 30 hours of conversational speech data from non-native speakers of English with approximately 24 hours or 80% of the data closely transcribed. The speech in this release was collected from 22 non-native, Hispanic speakers of English and consists of spontaneous speech and read utterances. The read speech is divided equally between English and Spanish.

NIST 2012 Open Machine Translation Progress Tests ~ contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT12 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. This set is based on a subset of the Arabic-to-English and Chinese-to-English Progress tests from the NIST Open Machine Translation 2008, 2009, and 2012 evaluations with new source data created based on the English human reference translation reference. The original data consists of newswire and web data.

NIST Open Machine Translation 2008 to 2012 Progress Test Sets ~ contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English Progress tests of the NIST Open Machine Translation 2008, 2009, and 2012 Evaluations. The test sets consist of newswire and web data.

OntoNotes 5.0 ~ multiple genres of English, Chinese, and Arabic text annotated for syntax, predicate argument structure and shallow semantics.

UN Parallel Text ~ contains the text of United Nations parliamentary documents in Arabic, Chinese, English, French, Russian, and Spanish from 1993 through 2007. The data is provided in two formats: (1) raw text: the raw text is very close to what was extracted from the word processing documents, converted to UTF-8 encoding, and (2) word-aligned text: the word-aligned text has been normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential "chunk-pairs", and then aligned at the word-level.

2013 Subscription Members are automatically sent all MY2013 data as it is released. 2013 Standard Members are entitled to request 16 corpora for free from MY2013. Non-members may license most data for research use. Visit our Announcements page for information on pricing.

New LDC Podcast, LDC Executive Director, Christopher Cieri

The LDC blog has a new podcast in LDC’s 20th Anniversary series. This edition features LDC’s Executive Director, Christopher Cieri. In this podcast, Chris reflects on the road that took him to LDC, some of his early responsibilities and recent consortium activities.

Click here for Chris’ podcast. Other podcasts will be published via the LDC blog , so stay tuned to that space.

New publications

(1) GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 was developed by LDC and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program. Broadcast audio for the DARPA GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites.

The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC's local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular; all signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output.

The broadcast conversation recordings in this release feature interviews, call-in programs and round table discussions focusing principally on current events from several sources. This release contains 143 audio files presented in .wav, 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of LDCs broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program's genre, data type and topic.

GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 is distributed on 4 DVDs. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora.

(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 was developed by LDC and contains transcriptions of approximately 123 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. The source broadcast conversation recordings feature interviews, call-in programs and round table discussions focusing principally on current events from several sources.

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 752,747 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDCs quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.

(3) NIST 2012 Open Machine Translation (OpenMT) Evaluation was developed by NIST Multimodal Information Group. This release contains source data, reference translations and scoring software used in the NIST 2012 OpenMT evaluation, specifically, for the Chinese-to-English language pair track. The package was compiled and scoring software was developed at NIST, making use of Chinese newswire and web data and reference translations collected and developed by LDC. The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original.

The 2012 task was to evaluate five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English. This release consists of the material used in the Chinese-to-English language pair track. For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website.

This release contains 222 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Chinese newswire and web data collected by LDC in 2011. A portion of the web data concerned the topic of food and was treated as a restricted domain. The table below displays statistics by source, genre, documents, segments and source tokens.

Source	Genre	Documents	Segments	Source Tokens
Chinese General	Newswire	45	400	18184
Chinese General	Web Data	28	420	15181
Chinese Restricted Domain	Web Data	149	2184	48422

The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "/w". The Python “re” module was used to obtain those counts.

NIST 2012 Open Machine Translation (OpenMT) Evaluation is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.