Linguistic Data Consortium

Friday, March 15, 2013

LDC March 2013 Newsletter

LDC’s 20th Anniversary: Concluding a Year of Celebration

New publications:

1993-2007 United Nations Parallel Text

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web

LDC’s 20th Anniversary: Concluding a Year of Celebration

We’ve enjoyed celebrating our 20th Anniversary this last year (April 2012 - March 2013) and would like to review some highlights before its close.

Our 2012 User Survey, circulated early in 2012, included a special Anniversary section in which respondents were asked to reflect on their opinions of, and dealings with, LDC over the years. We were humbled by the response. Multiple users mentioned that they would not be able to conduct their research without LDC and its data. For a full list of survey testimonials, please click here.

LDC also developed its first-ever timeline (initially published in the April 2012 Newsletter) marking significant milestones in the consortium’s founding and growth.

In September, we hosted a 20th Anniversary Workshop that brought together many friends and collaborators to discuss the present and future of language resources.

Throughout the year, we conducted several interviews of long-time LDC staff members to document their unique recollections of LDC history and to solicit their opinions on the future of the Consortium. These interviews are available as podcasts on the LDC Blog.

As our Anniversary year draws to a close, one task remains – to thank all of LDC’s past, present and future members and other friends of the Consortium for their loyalty and for their contributions to the community. LDC would not exist if not for its supporters. The variety of relationships that LDC has built over the years is a direct reflection of the vitality, strength and diversity of the community. We thank you all and hope that we continue to serve your needs in our third decade and beyond.

For a last treat, please visit LDC’s newly-launched YouTube channel to enjoy this video montage of the LDC staff interviews featured in the podcast series.

Thank you again for your continued support!

New publications

(1) 1993-2007 United Nations Parallel Text was developed by Google Research. It consists of United Nations (UN) parliamentary documents from 1993 through 2007 in the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish.

UN parliamentary documents are available from the UN Official Document System (UN ODS). UN ODS, in its main UNDOC database, contains the full text of all types of UN parliamentary documents. It has complete coverage datng from 1993 and variable coverage before that. Documents exist in one or more of the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS also contains a large number of German documents, marked with the language other, but these are not included in this dataset.

LDC has released parallel UN parliamentary documents in English, French and Spanish spanning the period 1988-1993, UN Parallel Text (Complete) (LDC94T4A).

The data is presented as raw text and word-aligned text. There are 673,670 raw text documents and 520,283 word aligned documents. The raw text is very close to what was extracted from the original word processing documents in UN ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding. The word-aligned text was normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential chunk-pairs, and then aligned at the word. The sentence, chunk, and word alignment operations were performed separately for each individual language pair.

1993-2007 United Nations Parallel Text is distributed on 3 DVD-ROM. 2013 Subscription Members will automatically receive two copies of this data provided they have completed the UN Parallel Text Corpus User Agreement. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

(2) GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web was developed by LDC and contains 158,387 tokens of word aligned Chinese and English parallel text enriched with linguistic tags. This material was used as training data in the DARPA GALE (Global Autonomous Language Exploitation) program.

Some approaches to statistical machine translation include the incorporation of linguistic knowledge in word aligned text as a means to improve automatic word alignment and machine translation quality. This is accomplished with two annotation schemes: alignment and tagging. Alignment identifies minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. Tagging adds contextual, syntactic and language-specific features to the alignment annotation.

This release consists of Chinese source web data (newsgroup, weblog) collected by LDC between 2005-2010. The distribution by words, character tokens and segments appears below:

Language	Files	Words	CharTokens	Segments
Chinese	1,224	105,591	158,387	4,836

Note that all token counts are based on the Chinese data only. One token is equivalent to one character and one word is equivalent to 1.5 characters.

The Chinese word alignment tasks consisted of the following components:

Identifying, aligning, and tagging 8 different types of links

Identifying, attaching, and tagging local-level unmatched words

Identifying and tagging sentence/discourse-level unmatched words

Identifying and tagging all instances of Chinese 的(DE) except when they were a part of a semantic link.

GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Thursday, March 7, 2013

LDC Timeline: 1992 - 2012

LDC Timeline – Two Decades of Milestones

April 15, 2012 marked the “official” 20th anniversary of LDC’s founding. As our Anniversary year draws to a close, LDC would like to share with the blogging community a brief timeline of some significant milestones.

1992: The University of Pennsylvania is chosen as the host site for LDC in response to a call for proposals issued by DARPA; the mission of the new consortium is to operate as a specialized data publisher and archive guaranteeing widespread, long-term availability of language resources. DARPA provides seed money with the stipulation that LDC become self-sustaining within five years. Mark Liberman assumes duties as LDC’s Director with a staff that grows to four, including Jack Godfrey, the Consortium’s first Executive Director.
1993: LDC’s catalog debuts. Early releases include benchmark data sets such as TIMIT, TIPSTER, CSR and Switchboard, shortly followed by the Penn Treebank.
1994: LDC and NIST (the National Institute of Standards and Technology) enter into a Cooperative R&D Agreement that provides the framework for the continued collaboration between the two organizations.
1995: Collection of conversational telephone speech and broadcast programming and transcription commences. LDC begins its long and continued support for NIST common task evaluations by providing custom data sets for participants. Membership and data license fees prove sufficient to support LDC operations, satisfying the requirement that the Consortium be self-sustaining.
1996: The Lexicon Development Project, under the direction of Dr. Cynthia McLemore, begins releasing pronouncing lexicons in Mandarin, German, Egyptian Colloquial Arabic, Spanish, Japanese and American English. By 1997, all are published.
1997: LDC announces LDC Online, a searchable index of newswire and speech data with associated tools to compute n-gram models, mutual information and other analyses.
1998: LDC adds annotation to its task portfolio. Christopher Cieri joins LDC as Executive Director and develops the annotation operation.
1999: Steven Bird joins LDC; the organization begins to develop tools and best practices for general use. The Annotation Graph Toolkit results from this effort.
2000: LDC expands its support of common task evaluations from providing corpora to coordinating language resources across the program. Early examples include the DARPA TIDES, EARS and GALE programs.
2001: The Arabic treebank project begins.
2002: LDC moves to its current facilities at 3600 Market Street, Philadelphia with a full-time staff of approximately 40 persons.
2004: LDC introduces the Standard and Subscription membership options, allowing members to choose whether to receive all or a subset of the data sets released in a membership year.
2005: LDC makes task specifications and guidelines available through its projects web pages.
2008: LDC introduces programs that provide discounts for continuing members and those who renew early in the year.
2010: LDC inaugurates the Data Scholarship program for students with a demonstrable need for data.
2012: LDC’s full-time staff of 50 and 196 part-time staff support ongoing projects and operations which include collecting, developing and archiving data, data annotation, tool development, sponsored-project support and multiple collaborations with various partners. The general catalog contains over 500 holdings in more than 50 languages. Over 85,000 copies of more than 1300 titles have been distributed to over 3200 organizations in 70 countries.

2012 User Survey Testimonials

As LDC's 20th Anniversary Year draws to a close, we would like to take this opportunity to share a few more Anniversary year activities with you.

In early 2012, LDC circulated a user survey to recent members and data licensees. Part of this survey focused on our then forthcoming Anniversary year and asked if respondents would provide anonymous testimonials supporting LDC. We are happy to report that many respondents took part and you may browse a selection of their comments below. Many humored LDC by playing along with the suggestion to describe the Consortium in one word or to compare LDC to a color, fruit or animal. LDC was humbled by the outpouring of support and would like to again thank all of our members and the entire community for continuously supporting the Consortium's existence.

2012 LDC User Survey Testimonials

· If LDC did not exist, it would have to be invented. It provides critical resources for the speech technology community.

· I wish that I were more ambitious and could use all of the datasets the LDC provides!

· I recently started as a new Assistant Professor in an undergraduate college with little access to research funds. The LDC staff bent over backward to allow me access to the materials I needed without the budget of a research university.

· Thanks for the good work.

· Keep on publishing.

· Timely and competent follow-up from LDC staff regarding any queries or problems

· I like LDC because they are very professional, very responsive, charge reasonable fees and have very friendly and helpful personnel

· Researchers in public institutions need organizations like LDC.

· Happy birthday LDC. Keep up the good work!

· Congratulations for your hard work, and for sharing tools with the world

· LDC is a great speech corpus provider for worldwide languages.

· LDC is best of breed in providers of high quality curated textual data, including some very large data sets.

· LDC is a great resource for researchers - keeping up with the times with new databases each year.

· (Organization name withheld) would like to extend sincere greetings to the LDC and to its great team, and a sincere "THANK YOU" for the wonderful service you have provided. May we celebrate your 100th anniversary!

· I like LDC because they provide good service at a reasonable price for academic institutions.

· There's no data like more data, and LDC is where it's at.

· I like LDC because it relieves us from troublesome negotiations with each provider of language resources.

· LDC is great. If it were a color, it would be teal (very hip).

· Blue as the sea because it helps researcher irrigate their research ideas.

· I would like to consider LDC as watermelon for its skin is green which is the symbol of flourishing life, the pulp is red which is the symbol of hope and success and the black seed is the essence of cohesion. In all, for researchers, LDC is very essential.

· Fruit: pomegranate, single body, many multiple frutties

· Description of LDC in 7 words: many corpora of a very high quality

· Describe LDC in one word: Astronomical

Monday, February 18, 2013

LDC February 2013 Newsletter

Spring 2013 LDC Data Scholarship Recipients!

Membership Fee Savings and Publications Pipeline

New LDC Podcast, LDC Executive Director, Christopher Cieri

New publications:

GALE Phase 2 Arabic Broadcast Conversation Speech Part 1

GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1

NIST 2012 Open Machine Translation (OpenMT) Evaluation

Spring 2013 LDC Data Scholarship Recipients!

LDC is pleased to announce the student recipients of the Spring 2013 LDC Data Scholarship program! This program provides university students with access to LDC data at no-cost. Students were asked to complete an application which consisted of a proposal describing their intended use of the data, as well as a letter of support from their thesis adviser. We received many solid applications and have chosen three proposals to support. The following students will receive no-cost copies of LDC data:

Salima Harrat - Ecole Supérieure d’informatique (ESI) (Algeria). Salima has been awarded a copy of Arabic Treebank: Part 3 for her work in diacritization restoration.

Maulik C. Madhavi - Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar (India). Maulik has been awarded a copy of Switchboard Cellular Part 1 Transcribed Audio and Transcripts and 1997 HUB4 English Evaluation Speech and Transcripts for his work in spoken term detection.

Shereen M. Oraby - Arab Academy for Science, Technology, and Maritime Transport (Egypt). Shereen has been awarded a copy of Arabic Treebank: Part 1 for her work in subjectivity and sentiment analysis.

Please join us in congratulating our student recipients! The next LDC Data Scholarship program is scheduled for the Fall 2013 semester.

Membership Fee Savings and Publications Pipeline

Time is quickly running out to save on membership fees for Membership Year 2013 (MY2013)! Any organization which joins or renews membership for 2013 through Friday, March 1, 2013, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2012 can receive a 10% discount on fees provided they renew prior to March 1, 2013.

Many publications for MY2013 are still in development. The planned publications for the upcoming months include:

GALE data ~ continuing releases of all languages (Arabic, Chinese, English), genres (Broadcast News, Broadcast Conversation, Newswire and Web Data) and tasks (Parallel Text, Word Alignment, Parallel Aligned Treebanks, Parallel Sentences, Audio and Transcripts).

Hispanic Accented English Database ~ 30 hours of conversational speech data from non-native speakers of English with approximately 24 hours or 80% of the data closely transcribed. The speech in this release was collected from 22 non-native, Hispanic speakers of English and consists of spontaneous speech and read utterances. The read speech is divided equally between English and Spanish.

NIST 2012 Open Machine Translation Progress Tests ~ contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plan for the OpenMT12 test for Arabic, Chinese, Dari, Farsi, and Korean to English on a parallel data set. This set is based on a subset of the Arabic-to-English and Chinese-to-English Progress tests from the NIST Open Machine Translation 2008, 2009, and 2012 evaluations with new source data created based on the English human reference translation reference. The original data consists of newswire and web data.

NIST Open Machine Translation 2008 to 2012 Progress Test Sets ~ contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English Progress tests of the NIST Open Machine Translation 2008, 2009, and 2012 Evaluations. The test sets consist of newswire and web data.

OntoNotes 5.0 ~ multiple genres of English, Chinese, and Arabic text annotated for syntax, predicate argument structure and shallow semantics.

UN Parallel Text ~ contains the text of United Nations parliamentary documents in Arabic, Chinese, English, French, Russian, and Spanish from 1993 through 2007. The data is provided in two formats: (1) raw text: the raw text is very close to what was extracted from the word processing documents, converted to UTF-8 encoding, and (2) word-aligned text: the word-aligned text has been normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential "chunk-pairs", and then aligned at the word-level.

2013 Subscription Members are automatically sent all MY2013 data as it is released. 2013 Standard Members are entitled to request 16 corpora for free from MY2013. Non-members may license most data for research use. Visit our Announcements page for information on pricing.

New LDC Podcast, LDC Executive Director, Christopher Cieri

The LDC blog has a new podcast in LDC’s 20th Anniversary series. This edition features LDC’s Executive Director, Christopher Cieri. In this podcast, Chris reflects on the road that took him to LDC, some of his early responsibilities and recent consortium activities.

Click here for Chris’ podcast. Other podcasts will be published via the LDC blog , so stay tuned to that space.

New publications

(1) GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 was developed by LDC and is comprised of approximately 123 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program. Broadcast audio for the DARPA GALE program was collected at LDC’s Philadelphia, PA USA facilities and at three remote collection sites.

The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program.

LDC's local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular; all signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output.

The broadcast conversation recordings in this release feature interviews, call-in programs and round table discussions focusing principally on current events from several sources. This release contains 143 audio files presented in .wav, 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker following Audit Procedure Specification Version 2.0 which is included in this release. The broadcast auditing process served three principal goals: as a check on the operation of LDCs broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program's genre, data type and topic.

GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 is distributed on 4 DVDs. 2013 Subscription Members will automatically receive two copies of this data. 2013 Standard Members may request a copy as part of their 16 free membership corpora.

(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 was developed by LDC and contains transcriptions of approximately 123 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) program. The source broadcast conversation recordings feature interviews, call-in programs and round table discussions focusing principally on current events from several sources.

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 752,747 tokens. The transcripts were created with the LDC-developed transcription tool, XTrans, a multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDCs quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release. QTR transcription consists of quick (near-)verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. It does not include sentence unit annotation. QRTR annotation adds structural information such as topic boundaries and manual sentence unit annotation to the core components of a quick transcript. Files with QTR as part of the filename were developed using QTR transcription. Files with QRTR in the filename indicate QRTR transcription.

GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.

(3) NIST 2012 Open Machine Translation (OpenMT) Evaluation was developed by NIST Multimodal Information Group. This release contains source data, reference translations and scoring software used in the NIST 2012 OpenMT evaluation, specifically, for the Chinese-to-English language pair track. The package was compiled and scoring software was developed at NIST, making use of Chinese newswire and web data and reference translations collected and developed by LDC. The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original.

The 2012 task was to evaluate five language pairs: Arabic-to-English, Chinese-to-English, Dari-to-English, Farsi-to-English and Korean-to-English. This release consists of the material used in the Chinese-to-English language pair track. For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website.

This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation.

This release contains 222 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Chinese newswire and web data collected by LDC in 2011. A portion of the web data concerned the topic of food and was treated as a restricted domain. The table below displays statistics by source, genre, documents, segments and source tokens.

Source	Genre	Documents	Segments	Source Tokens
Chinese General	Newswire	45	400	18184
Chinese General	Web Data	28	420	15181
Chinese Restricted Domain	Web Data	149	2184	48422

The token counts for Chinese data are "character" counts, which were obtained by counting tokens matching the UNICODE-based regular expression "/w". The Python “re” module was used to obtain those counts.

NIST 2012 Open Machine Translation (OpenMT) Evaluation is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.

Tuesday, January 29, 2013

LDC 20th Anniversary Podcast: Christopher Cieri

LDC is moving towards the end of its Anniversary year, but that does not mean that we don’t have a few more treats for you. This month’s podcast features LDC’s Executive Director, Christopher Cieri.

Chris is involved with every aspect of the Consortium, including planning, development, operations, sponsored projects, external relations and financial performance. In this podcast, Chris reflects on the road that took him to LDC, some of his early responsibilities and recent consortium activities.

Click here to listen to Chris' podcast.

Tuesday, January 15, 2013

LDC January 2013 Newsletter

2013 LDC Podcast Available from LDC Blog

Membership Discounts for MY 2013 Still Available

Penn Discourse Treebank Version 2.0 Update - RTE data

New publications:

Chinese-English Biology and Chemistry Abstract Parallel Text

GALE Phase 2 Arabic Web Parallel Text

2013 LDC Podcast Available from LDC Blog

Kicking off the new year is the fourth podcast in our 20th Anniversary series featuring LDC Senior Researcher, Mohamed Maamouri.

Mohamed directs the Arabic Treebank group and spearheads the development of Arabic resources and projects. The latter includes the leading role in LDC’s collaboration with Georgetown University Press to develop updated versions of three dialectal Arabic dictionaries (Iraqi, Moroccan, Syrian). In this podcast, he reflects on his personal and professional experiences and comments on Arabic resource development at LDC.

Click here for Mohamed’s podcast.

Other podcasts will be published via the LDC Blog, so stay tuned to that space.

Membership Discounts for MY 2013 Still Available

If you are considering joining for Membership Year 2013 (MY2013), there is still time to save on membership fees. Any organization which joins or renews membership for 2013 through Friday, March 1, 2013, is entitled to a 5% discount on membership fees. Organizations which held membership for MY2012 can receive a 10% discount on fees provided they renew prior to March 1, 2013. For further information on pricing, please consult our Announcements page or contact LDC.

Penn Discourse Treebank Version 2.0 Update - RTE data

A Recognizing Textual Entailment (RTE) update is now available for Penn Discourse Treebank Version 2.0 LDC2008T05 (PDTB). This data has been used to run the textual entailment experiments described in: Sara Tonelli and Elena Cabrio "Hunting for Entailing Pairs in the Penn Discourse Treebank", in Proceedings of Coling 2012, Mumbay, India. The files contain Text - Hypothesis pairs in the standard RTE xml format (for more details, see RTE Challenge at TAC 2011), which have been manually annotated as entailing or not entailing. All sentence pairs have been extracted from the Penn Discourse Treebank and are therefore connected by a discourse relation label.

The data are not included in the general release of Penn Discourse Treebank Version 2.0, but are freely available for download from the catalog page.

New Publications

(1) Chinese-English Biology and Chemistry Abstract Parallel Text was developed by The MITRE Corporation. It consists of parallel sentences from a collection of chemistry and biology-related scientific article abstracts published in Mandarin and translated into English by translators with particular expertise in the technical area. Translators were instructed to err on the side of literal translation if required, but to maintain the technical writing style of the source and make the resulting English as natural as possible. The translators were given specific guidelines for translation, and those are included in this distribution.

This release contains 2,239 lines of parallel Mandarin and English, with a total of 156,445 characters of Mandarin and 75,515 words of English, presented in a separate UTF-8 plain text file for each language. The sentences were translated in sequential order and presented in scrambled order, such that parallel sentences at identical line numbers are translations. For example, the 31st line of the English file is a translation of the 31st line of the Mandarin file. The original line sequence is not provided.

Chinese-English Biology and Chemistry Abstract Parallel Text is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.

(2) GALE Phase 2 Arabic Web Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 2 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from web data collected in 2007 by LDC and transcribed by LDC or under its direction. GALE Phase 2 Arabic Web Parallel Text includes 60 source-translation document pairs, comprising 42,089 words of Arabic source text and its English translation. Data was drawn from various Arabic weblog and newsgroup sources.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC. Transcribers indicated sentence boundaries in addition to transcribing the text. Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Arabic to English translation guidelines.

Bilingual LDC staff performed quality control procedures on the completed translations. Source data and translations are distributed in TDF format. TDF files are tab-delimited files containing one segment of text along with meta information about that segment.

GALE Phase 2 Arabic Web Parallel Text is distributed via web download. 2013 Subscription Members will automatically receive two copies of this data on disc. 2013 Standard Members may request a copy as part of their 16 free membership corpora.

Friday, January 11, 2013

LDC 20th Anniversary Podcast: Mohamed Maamouri

Happy New Year and welcome back to the LDC Blog. For our first post of the year, we present the fourth podcast in our anniversary series featuring LDC Senior Researcher, Mohamed Maamouri.

Mohamed directs the Arabic Treebank group and spearheads the development of Arabic resources and projects. The latter includes the leading role in LDC’s collaboration with Georgetown University Press to develop updated versions of three dialectal Arabic dictionaries (Iraqi, Moroccan, Syrian). Mohamed specializes in Arabic linguistics, reading, language development, corpus linguistics and sociolinguistics. In this podcast, he reflects on his personal and professional experiences and comments on Arabic resource development at LDC.

Click here for Mohamed’s podcast.