Thursday, September 15, 2016

LDC September 2016 Newsletter

New publications:
New Corpora

(1) ARL Arabic Dependency Treebank was developed by the US Army Research Laboratory (ARL) and was derived from four LDC resources: Arabic Treebank (ATB) Part 1 v4.1 (LDC2010T13), Part 2 v3.1 (LDC2011T09), Part 3 v3.2 (LDC2010T08) and Broadcast News v1.0 (LDC2012T07).

LDC's ATB series follows the constituency or phrase structure approach to treebank development in which clauses are divided into noun phrases and verb phrases and in each sentence, one or more nodes may correspond to one element. Dependency grammar, on the other hand, is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. This is a one-to-one correspondence: for every element in the sentence there is one node in the sentence structure that corresponds to that element. ARL Arabic Dependency Treebank was generated using constituency-to-dependency software written at ARL.

The source data in this release consists of Arabic newswire and broadcast programming collected by LDC from various news and broadcast providers.

The files are in an 11-column tab-separated format with one or more blank lines between sentences. All files are UTF-8 encoded.

ARL Arabic Dependency Treebank is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. 

*

(2) BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training was developed by LDC and consists of 448,094 words of Chinese and English parallel text enhanced with linguistic tags to indicate word relations.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. The material in this release consists of Chinese source discussion forum threads harvested from the Internet by LDC using a combination of manual and automatic processes. The source data is released as BOLT Chinese Discussion Forums (LDC2016T05).

BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(3) IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 214 hours of Pashto conversational and scripted telephone speech collected in 2011 and 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Pashto speech in this release represents that spoken in four dialect regions of Afghanistan and Pakistan. The gender distribution among speakers is approximately 30% female, 70% male; speakers' ages range from 17 years to 70 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

Transcripts are available in two versions: an extended Arabic script and a modified Buckwalter transliteration scheme, both encoded in UTF-8.

IARPA Babel Pashto Language Pack IARPA is distributed via web download.

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the special license agreement. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(4) GALE Phase 4 Arabic Broadcast News Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source sentences and corresponding English translations selected from broadcast news data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 4 Arabic Broadcast News Parallel Sentences includes 106 source-translation document pairs, comprising 114,251 words (Arabic source) of translated data. Data is drawn from 24 distinct Arabic programs featuring news broadcasts.

GALE Phase 4 Arabic Broadcast News Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.



Monday, August 15, 2016

LDC August 2016 Newsletter


Fall 2016 Data Scholarship Program

LDC at Interspeech 2016

New Publications:
_______________________________________________________________________
Fall 2016 LDC Data Scholarship program - September 15 deadline approaching
Student applications for the Fall 2016 LDC Data Scholarship program are being accepted now through Thursday, September 15, 2016, 11:59PM EST.  The LDC Data Scholarship program provides university students with access to LDC data at no cost. Students must complete an application which consists of a data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please visit the LDC Data Scholarship page

Applicants can email their materials to the LDC Data Scholarship program

LDC at Interspeech 2016

LDC will once again be exhibiting at Interspeech, held this year September 9-12 in San Francisco, California. Stop by booth 17 to learn more about recent developments at the Consortium and new publications.

Also, be on the lookout for the following presentations featuring LDC work:

Automatic Analysis of Phonetic Speech Style Dimensions: Neville Ryant and Mark Liberman (both LDC)
Friday 9 September, Oral Session, Bayview A, 11:00am

The Rhythmic Constraint on Prosodic Boundaries in Mandarin Chinese Based on Corpora of Silent Reading and Speech Perception: Wei Lai (UPenn), Jiahong Yuan (LDC), Ya Li (Chinese Academy of Science), Xiaoying Xu (Beijing Normal University) and Mark Liberman (LDC)
Friday 9 September, Oral Session, Bayview A, 11:00am

Pitch-range Perception: the Dynamic Interaction Between Voice Quality and Fundamental Frequency: Jianjing Kuang (UPenn) and Mark Liberman (LDC)
Saturday 10 September, Poster Session A, 10:00am

Phoneme, Phone Boundary, and Tone in Automatic Scoring of Mandarin Proficiency: Jiahong Yuan and Mark Liberman (both LDC)
Sunday 11 September, Poster Session A, 10:00am

LDC will post conference updates via our Twitter feed and Facebook page. We hope to see you there!   


New Publications

(1) IARPA Babel Bengali Language Pack IARPA-babel103b-v0.4b was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 215 hours of Bengali conversational and scripted telephone speech collected in 2011 and 2012 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Bengali speech in this release represents that spoken in India by native speakers of Bengali born in India. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 65 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: the Bengali script and a romanization scheme developed by Appen Butler Hill, both encoded in UTF-8.

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the IARPA User Agreement for Not-for-Profit Members or the IARPA User Agreement for For-Profit Members. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee under a research license.

*
(2) IARPA Babel Assamese Language Pack IARPA-babel102b-v0.5a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 205 hours of Assamese conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The speech in this release represents three dialects spoken in Assam, a state in northeastern India. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 66 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: Assamese script and a romanization scheme developed by Appen Butler Hill, both encoded in UTF-8.

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the IARPA User Agreement for Not-for-Profit Members or the IARPA User Agreement for For-Profit Members. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee under a research license.

*
(3) GALE Phase 3 Arabic Broadcast News Speech Part 1 was developed by LDC and is comprised of approximately 132 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding transcripts are released as GALE Phase 3 Arabic Broadcast News Transcripts Part 1 (LDC2016T17).
  
The broadcast news recordings in this corpus feature news broadcasts focusing principally on current events from various broadcast programmers including Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Iraqiyah, Aljazeera, Al Ordiniyah, Dubai TV, Kuwait TV, Lebanese Broadcast Corporation, Nile TV, Saudi TV and Syria TV.

This release contains 175 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Arabic speaker.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*
(4) GALE Phase 3 Arabic Broadcast News Transcripts Part 1 was developed by LDC and contains transcriptions of approximately 132 hours of Arabic broadcast news speech collected in 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 3 Arabic Broadcast News Speech Part 1 (LDC2016S07).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 741,689 tokens. The transcripts were created with the LDC tool, XTrans, which supports manual transcription and annotation of audio recordings. XTrans is available from the following link, https://www.ldc.upenn.edu/language-resources/tools/xtrans.
  
The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC's quick transcription guidelines (QTR) and quick rich transcription specification (QRTR) both of which are included in the documentation with this release.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Tuesday, July 19, 2016

LDC July 2016 Newsletter

Fall 2016 Data Scholarship Program

2015 User Survey Results

New Publications:
______________________________________________________________________

Fall 2016 Data Scholarship Program

Applications are now being accepted through Thursday, September 15, 2016 for the Fall 2016 LDC Data Scholarship program. The LDC Data Scholarship program provides university students with access to LDC data at no-cost.

This program is open to students pursuing both undergraduate and graduate studies in an accredited college or university. LDC Data Scholarships are not restricted to any particular field of study; however, students must demonstrate a well-developed research agenda and a bona fide inability to pay. The selection process is highly competitive.

The application consists of two parts:
(1) Data Use Proposal. Applicants must submit a two-page proposal describing their intended use of the data. The proposal should state which data the student plans to use, how the data will benefit their research project, the proposed methodology or algorithm which will be used and how success will be measured.

Applicants should consult the Catalog for a complete list of data distributed by LDC. Due to certain restrictions, a handful of LDC corpora are restricted to members of the Consortium. Applicants are advised to select a maximum of one to two databases.

(2) Letter of Support. Applicants must submit one letter of support from their thesis adviser or department chair. The letter must be signed and printed on letterhead, describe the student and the research, evaluate the probability of success and confirm that the department or university lacks the funding to pay the full non-member fee for the data. 

For further information on application materials and program rules, please visit the LDC Data Scholarship page.


2015 User Survey Results
LDC conducted its fourth user survey in December 2015. This survey built on the previous surveys conducted in 2006, 2007 and 2012 to assess user sentiment and also asked for the evaluation of key LDC-related topics including:
·         Opinions on the new website and usability of the Catalog
·         Use and satisfaction with the enhanced user services and e-commerce system
·         LDC’s Data Management Plan capabilities
·         Suggestions for future publications and preferred data delivery methods
·         Use of web services for data access and processing

Overall, survey respondents were satisfied with LDC’s data, membership options, website, Catalog and enhanced user services. Participants cited the top five most useful corpora received between 2012 and 2015 as OntoNotes Release 5.0TIMITTAC KBP Reference Knowledge BasePenn Discourse Treebank V 2.0, and Multi-Channel WSJ Audio. Three fourths of respondents prefer digital delivery of data and the top three languages for current research demands were identified as English, Chinese and Spanish.

We thank everyone who participated in this survey. Responses will benefit the future of the Consortium and will help LDC to better meet the needs of our members and data licensees.


New Corpora

(1) English Speed Networking Conversational Transcripts was developed at the University of the West of England and contains 388 transcripts of English face-to-face and instant messaging conversations  about business ideas collected in 2014 and 2015 from participants (undergraduate students) playing different power roles.

This corpus was created to examine communication accommodation, specifically, the ways in which an individual's linguistic style is affected by social power and personality. The data was collected in two studies. In the first study, 40 participants had a series of paired five minute face-to-face conversations playing either a high, low or neutral power role. The same procedure was followed in the second study except that participants discussed business ideas via instant messaging.

The face-to-face conversations were audio-recorded and transcribed verbatim.

All transcripts are presented as UTF-8 plain text files.

English Speed Networking Conversational Transcripts is distributed via web download.
2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $400.00

*

(2) Digital Archive of Southern Speech - NLP Version (DASS-NLP) was developed by LDC as an alternate version of Digital Archive of Southern Speech (DASS) (LDC2012S03) suitable for natural language processing and human language technology applications. Specifically, the original audio files have been converted to 16kHz 16-bit flac compressed wav and file names have been normalized to facilitate automatic processing.

DASS was developed by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in turn part of the Linguist Atlas Project (LAP). DASS-NLP contains approximately 366 hours of English speech data from 30 female speakers and 34 male speakers, along with associated metadata about the speakers, the recordings and maps in .jpeg format relating to the recording locations.

LAP consists of a set of survey research projects about the words and pronunciation of everyday American English, the largest project of its kind in the United States. Interviews with thousands of native speakers across the country have been carried out since 1929. LAGS surveyed the everyday speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas, Louisiana, and Texas in a series of 914 audio-taped interviews conducted from 1968-1983.

The speakers' average age is 61 years; there are 30 women and 34 men from the Gulf States region represented in this release. The interviews cover common topics such as family, the weather, household articles and activities, agriculture and social conditions.   

Digital Archive of Southern Speech - NLP Version is distributed via web download.

2016 Not-for-Profit Subscription Members will automatically receive two copies of this corpus. 2016 For-Profit Subscription Members will receive two copies provided they have submitted a completed copy of the For-Profit Member User License Agreement for Digital Archive of Southern Speech – NLP Version (LDC2016S05). 2016 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.

 *

(3) GALE Phase 3 and 4 Chinese Broadcast News Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast news data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

GALE Phase 3 and 4 Chinese Broadcast News Parallel Text includes 76 source-translation document pairs, comprising 614,608 tokens of Chinese source text and its English translation. Data is drawn from 16 distinct Chinese programs broadcast between 2006 and 2008 by China Central TV, a national and international broadcaster in Mainland China and Phoenix TV, a Hong Kong-based satellite television station. The programs in this release feature news programs on current events topics.

The files in this release were transcribed by LDC staff and/or transcription vendors under contract to LDC in accordance with the Quick Rich Transcription guidelines developed by LDC.

Source data and translations are distributed in TDF format. All data are encoded in UTF-8.

GALE Phase 3 and 4 Chinese Broadcast News Parallel is distributed via web download

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $1750.00

 *

(4IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 215 hours of Cantonese conversational and scripted telephone speech collected in 2011 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop speech recognition technology that can be rapidly applied to any human language to support keyword search performance over large amounts of recorded speech.

The Cantonese speech in this release represents that spoken in the Chinese provinces of Guangdong and Guangxi, and within those provinces, among five dialect groups. The gender distribution among speakers is approximately even; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle.

All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere format. Transcripts are available in two versions: simplified Chinese characters and a romanization scheme based on the Yale system, both encoded in UTF-8.

IARPA Babel Cantonese Language Pack IARPA is distributed via web download

2016 Subscription Members will receive two copies of this corpus provided they have submitted a completed copy of the IARPA User Agreement for Not-for-Profit Members or the IARPA User Agreement for For-Profit Members. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US $25.00 under a research license


Thursday, June 16, 2016

LDC June 2016 Newsletter

Commercial use and LDC data

New publications:
_______________________________________________________________
Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a pre-requisite for obtaining a commercial license to almost all LDC databases. Non-member organizations, including non-member for-profit organizations, cannot use LDC data to develop or test products for commercialization, nor can they use LDC data in any commercial product or for any commercial purpose. LDC data users should consult corpus-specific license agreements for limitations on the use of certain corpora. Visit our Licensing page for more information.

New Corpora

(1) Chinese Treebank 9.0 consists of approximately two million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech. This new data set in the Chinese Treebank series adds more annotated web data and two new genres – chat messages and transcribed telephone speech.

There are 3,726 text files in this release, containing 132,076 sentences, 2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is provided in the UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked.

Chinese Treebank 9.0 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

*

(2) CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker metadata. The goal of this work was to support spoken term detection and forensic speaker identification.

This corpus is comprised of Mexican Spanish microphone speech from 75 male speakers and 75 female speakers in a quiet office environment. Speakers could answer pre-selected open questions or describe a particular painting shown to them on a computer monitor. Speaker metadata in this release includes age, gender, place of birth, place of residence and parents' nationalities.

CHM150 is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. This data is being made available at no-cost for non-member organizations under a research license.

*

(3) GALE Phase 4 Arabic Weblog Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations, selected from newsgroup and weblog data collected by LDC and translated by LDC or under its direction.

The data includes 1,067 source-translation document pairs, comprising 68,346 words (Arabic source) of translated data. 

Sentences were selected for translation in two steps. First, files were chosen using sentence selection scripts provided by GALE program participants SRI International and IBM. The output was then manually reviewed by LDC staff to eliminate problematic sentences. Selected files were reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines and were provided with the full source documents containing the target sentences for their reference. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 4 Arabic Weblog Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.

Monday, May 16, 2016

LDC May 2016 Newsletter

LDC at LREC 2016

New publications:
GALE Phase 4 Chinese Broadcast Conversation Speech
GALE Phase 4 Chinese Broadcast Conversation Transcripts 
_______________________________________________________________

LDC at LREC 2016

LDC will attend the 10th Language Resource Evaluation Conference (LREC2016), hosted by ELRA, the European Language Resource Association. The conference will be held in Portorož, Slovenia from May 23-28 and features a broad range of sessions on language resources and human language technologies research. Seven LDC staff members will be presenting current work on topics including trends in HLT research, building language resources for autism spectrum disorders, data management plans, rapid development of morphological analyzers for typologically diverse languages, selection criteria for low resource language programs, multi-language speech collection for NIST LRE, novel incentives for collecting data and annotation from people, and more.

Following the conference, LDC’s presented papers and posters will be available on LDC’s Papers Page.


New Corpora

(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing consists of data, tools, system results, and publications associated with the 2014 and 2015 tasks on Broad-Coverage Semantic Dependency Parsing (SDP) conducted in conjunction with the International Workshop on Semantic Evaluation (SemEval) and was developed by the SDP task organizers.

SemEval is an ongoing series of evaluations of computational semantic analysis systems intended to explore the nature of meaning in language. It evolved from the Senseval word sense disambiguation series to include semantic analysis tasks outside of word sense disambiguation.

This release is based on English, Chinese and Czech data from the following resources: Treebank-2 LDC95T17, Proposition Bank I LDC2004T14, NomBaank v 1.0 LDC2008T23 and CCGBank LDC2005T13 (English); Chinese Treebank (e.g., Chinese Treebank 8.0 LDC2013T21) (Chinese); and Prague Dependency Treebank (e.g., Prague Dependency Treebank 2.0, LDC2006T01) (Czech).

The results are presented as graphs in three target representations: MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures (PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional target representation CCGbank was converted to semantic dependency graphs (in the subdirectory ‘ccd’).


SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 4 Chinese Broadcast Conversation Speech was developed by LDC and is comprised of approximately 172 hours of Mandarin Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast Conversation Transcripts (LDC2016T12).

The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events and are contained in 236 audio files presented in FLAC-compressed Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a native Chinese speaker following Audit Procedure Specification Version 2.0 which is included in this release.


GALE Phase 4 Chinese Broadcast Conversation Speech is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALE Phase 4 Chinese Broadcast Conversation Transcripts was developed by LDC and contains transcriptions of approximately 172 hours of Chinese broadcast conversation speech collected in 2008 by LDC and Hong Kong University of Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast Conversation Speech (LDC2016S03).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8 encoding, and the transcribed data totals 2,259,952 tokens.

The files in this corpus were transcribed by LDC staff and/or by transcription vendors under contract to LDC. Transcribers followed LDC’s quick transcription guidelines (QTR) and quick rich transcription specification (QRTR). QTR transcription consists of quick (near-) verbatim, time-aligned transcripts plus speaker identification with minimal additional mark-up. QRTR adds additional structural information such as topic boundaries and manual sentence unit annotation.


GALE Phase 4 Chinese Broadcast Conversation Transcripts is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus.  2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.


Monday, April 18, 2016

LDC April 2016 Newsletter

New publications:

_________________________________________________________________________

New Corpora

(1) H1 Children's Writing was developed by the Cooperative State University Baden-WürttembergUniversity of Education. It consists of 996 texts written over three months by 88 German school children age seven through eleven years.

Texts were written within regular class settings. The students were presented with a picture and were asked to write a story, to describe the picture or if unable to write a text, to list what they saw in the picture. The pictures were designed to enhance the output with respect to important spelling error categories, namely, the marking of short vowels with a silent consonant letter and the correct spelling of the long vowel. The children were allowed at least 15 minutes to write the texts. This exercise was repeated weekly for 12 weeks.

Most of the participants were multilingual. The metadata with this releases includes: school week of collection; school type (always elementary school); age; gender; grade/classroom; language spoken at home; and school materials used for German (Jojo).

In all, 996 texts representing 62,764 tokens were collected. The texts were digitized in two forms: (1) the original text, including all errors (achieved), and (2) the intended (target) text, where all spelling errors were removed. Annotations were added to both the achieved text and the target text to distinguish words that should not be analyzed for spelling errors, such as names or foreign words. For sentence-level analysis, syntax errors were annotated by marking substitutions, deletions and insertions at the word level. In such cases, the used word was analyzed for spelling, and the correct word was used for sentence structure analysis.

Original handwriting is presented as pdf documents and the converted text as UTF-8 plain text in csv documents.

H1 Children's Writing is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source sentences and corresponding English translations selected from broadcast conversation data collected by LDC in 2007 and 2008 and transcribed and translated by LDC or under its direction.

The data includes 170 source-translation document pairs, comprising 44,064 words (Arabic source) of translated data. Data is drawn from 45 distinct Arabic broadcast conversation sources.

GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) HAVIC Pilot Transcription was developed by LDC and is comprised of approximately 72 hours of user-generated videos with transcripts based on the English speech audio extracted from the videos. This data set was created in collaboration with NIST (the National Institute of Standards and Technology) as part of the HAVIC (the Heterogeneous Audio Visual Internet Collection) project, the goal of which is to advance multimodal event detection and related technologies.

LDC has developed a large, heterogeneous, annotated multimodal corpus for HAVIC that has been used in the NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC Pilot Transcription supported an experiment to produce a verbatim transcript (quick and rich transcription) based on audio extracted from user-generated videos. It contains the pilot transcripts for selected MED 2011 video files as well as the associated videos.

Annotators generated the transcripts using XTrans, which supports manual transcription across multiple channels, languages and platforms. HAVIC transcription guidelines are included in the documentation for this release. All transcription files are in .tdf format, a plain-text, flat-table format with 13 tab-delimited fields. All video files are in .mp4 format (h264), with varying bit-rates and levels of audio fidelity and video resolution.

HAVIC Pilot Transcription is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee. 

Tuesday, March 15, 2016

LDC March 2016 Newsletter

New publications:

GALE Phase 3 and 4 Arabic Web Parallel Text 
GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
_____________________________________________________________________

New Corpora
(1) DEFT Narrative Text was developed by LDC and contains proxy reports and their source newswire used to support DARPA's Deep Exploration and Filtering of Text (DEFT) program. One of the goals of the DEFT program was to develop technologies that can perform various NLP tasks on data in a variety of genres, both formal and informal.

LDC provided source data and annotations for DEFT system development. DEFT Narrative Text consists of "proxy reports" (and "multi-proxy reports") in English. (Multi-)proxy reports are intended to mimic the format and other features of some types of government analyst reports using content from newswire articles. The corresponding English newswire source documents are also included in the release.

LDC staff manually selected the source newswire from English Gigaword Fifth Edition (LDC2011T07).

The newswire source documents are XML files following the Gigaword corpus format. The proxy reports are in plain text format.

DEFT Narrative Text is distributed via web download.
2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(2) GALEPhase 3 and 4 Arabic Web Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Modern Standard Arabic source text and corresponding English translations selected from weblog and newsgroup data collected by LDC and translated by LDC or under its direction.

The data includes 124 source-translation document pairs, comprising 61,662 tokens of Arabic source text and its English translation. Data is drawn from four various Arabic weblog and newsgroup sources.

GALE Phase 3 and 4 Arabic Web Parallel Text is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.

*

(3) GALEPhase 3 and 4 Chinese Broadcast Conversation Parallel Text was developed by LDC. Along with other corpora, the parallel text in this release comprised training data for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language Exploitation) Program. This corpus contains Chinese source text and corresponding English translations selected from broadcast conversation data collected by LDC between 2006 and 2008 and transcribed and translated by LDC or under its direction.

This data includes 63 source-translation document pairs, comprising 487,466 tokens of Chinese source text and its English translation. Data is drawn from 19 distinct Chinese programs broadcast between 2006 and 2008.

Data was manually selected for translation according to several criteria, including linguistic features, transcription features and topic features. The transcribed and segmented files were then reformatted into a human-readable translation format and assigned to translation vendors. Translators followed LDC's Chinese to English translation guidelines. Bilingual LDC staff performed quality control procedures on the completed translations.

GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text is distributed via web download.

2016 Subscription Members will automatically receive two copies of this corpus. 2016 Standard Members may request a copy as part of their 16 free membership corpora.  Non-members may license this data for a fee.