LDC at ACL:  June 20-22,             2011  
          
LDC is now on your favorite Social Networks
            
New Publications:
            
LDC2011S02
- 2006 NIST Spoken Term Detection Development Set -
            
LDC2011T08
- Datasets for Generic Relation Extraction (reACE) -
            
LDC2011T07
- English Gigaword Fifth Edition -
                  
LDC is now on your favorite Social Networks
New Publications:
LDC2011S02
- 2006 NIST Spoken Term Detection Development Set -
LDC2011T08
- Datasets for Generic Relation Extraction (reACE) -
LDC2011T07
- English Gigaword Fifth Edition -
           ACL has returned to North America and LDC is taking this           opportunity to interact with top HLT researchers in beautiful           Portland, OR.  LDC’s exhibition table will feature information           on new developments at the consortium and will also be the           go-to point for exciting new, green giveaways.
     
LDC’s Seth Kulick will be presenting research on ‘Using Derivation Trees for Treebank Error Detection’ (S-66) during Monday’s evening poster session (20 June, 6.00 – 8.30 pm). The abstract for this paper, coauthored by LDCers Ann Bies and Justin Mott, is as follows:
       
This work introduces a new approach to checking treebank consistency. Derivation trees based on a variant of Tree Adjoining Grammar are used to compare the annotation of word sequences based on their structural similarity. This overcomes the problems of earlier approaches based on using strings of words rather than tree structure to identify the appropriate contexts for comparison. We report on the result of applying this approach to the Penn Arabic Treebank and how this approach leads to high precision of error detection.
           
We hope to see you there.
       
            
          Over the past few months, LDC has           responded to requests from the community to increase our           online presence.  We are happy to announce that LDC now has           its very own Facebook page, LinkedIn profile (independent of the           University of Pennsylvania) and Blog, which           provides an RSS feed for LDC newsletters.  Please visit LDC on           our various profiles and let us know what you think! LDC’s Seth Kulick will be presenting research on ‘Using Derivation Trees for Treebank Error Detection’ (S-66) during Monday’s evening poster session (20 June, 6.00 – 8.30 pm). The abstract for this paper, coauthored by LDCers Ann Bies and Justin Mott, is as follows:
This work introduces a new approach to checking treebank consistency. Derivation trees based on a variant of Tree Adjoining Grammar are used to compare the annotation of word sequences based on their structural similarity. This overcomes the problems of earlier approaches based on using strings of words rather than tree structure to identify the appropriate contexts for comparison. We report on the result of applying this approach to the Penn Arabic Treebank and how this approach leads to high precision of error detection.
We hope to see you there.
New Publications
     (1) 2006 NIST Spoken Term Detection             Development Set           was compiled by researchers at NIST (National Institute of           Standards and Technology) and contains eighteen hours of           Arabic, Chinese and English broadcast news, English           conversational telephone speech and English meeting room           speech used in NIST's 2006 Spoken Term Detection (STD)             evaluation. The           STD initiative is designed to facilitate research and           development of technology for retrieving information from           archives of speech data with the goals of exploring promising           new ideas in spoken term detection, developing advanced           technology incorporating these ideas, measuring the           performance of this technology and establishing a community           for the exchange of research results and technical insights.
The 2006 STD task was to find all of the occurrences of a specified term (a sequence of one or more words) in a given corpus of speech data. The evaluation was intended to develop technology for rapidly searching very large quantities of audio data. Although the evaluation used modest amounts of data, it was structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements to much larger data sets. Therefore, systems were implemented in two phases: indexing and searching. In the indexing phase, the system processes the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences.
       
The development corpus consists of three data genres: broadcast news (BN), conversational telephone speech (CTS) and conference room meetings (CONFMTG). The broadcast news material was collected in 2001 by LDC's broadcast collection system from the following sources: ABC (English), China Broadcasting System (Chinese), China Central TV (Chinese), China National Radio (Chinese), China Television System (Chinese), CNN (English), MSNBC/NBC (English), Nile TV (Arabic), Public Radio International (English) and Voice of America (Arabic, Chinese, English). The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English Training Sppech Part 1 LDC2004S13), also collected by LDC. The conference room meeting material consists of goal-oriented, small group round table meetings and was collected in 2001, 2004 and 2005 by NIST, the International Computer Science Institute (Berkeley, California), Carnegie Mellon University (Pittsburgh, PA) and Virginia Polytechnic Institute and State University (Blacksburg, VA) as part of the AMI corpus project.
       
Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. TheCONFMTG files contain a single recorded channel.
       
2006 NIST Spoken Term Detection Development Set is distributed on 1 DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$800.
       
              
The 2006 STD task was to find all of the occurrences of a specified term (a sequence of one or more words) in a given corpus of speech data. The evaluation was intended to develop technology for rapidly searching very large quantities of audio data. Although the evaluation used modest amounts of data, it was structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements to much larger data sets. Therefore, systems were implemented in two phases: indexing and searching. In the indexing phase, the system processes the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences.
The development corpus consists of three data genres: broadcast news (BN), conversational telephone speech (CTS) and conference room meetings (CONFMTG). The broadcast news material was collected in 2001 by LDC's broadcast collection system from the following sources: ABC (English), China Broadcasting System (Chinese), China Central TV (Chinese), China National Radio (Chinese), China Television System (Chinese), CNN (English), MSNBC/NBC (English), Nile TV (Arabic), Public Radio International (English) and Voice of America (Arabic, Chinese, English). The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English Training Sppech Part 1 LDC2004S13), also collected by LDC. The conference room meeting material consists of goal-oriented, small group round table meetings and was collected in 2001, 2004 and 2005 by NIST, the International Computer Science Institute (Berkeley, California), Carnegie Mellon University (Pittsburgh, PA) and Virginia Polytechnic Institute and State University (Blacksburg, VA) as part of the AMI corpus project.
Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. TheCONFMTG files contain a single recorded channel.
2006 NIST Spoken Term Detection Development Set is distributed on 1 DVD-ROM. 2011 Subscription Members will automatically receive two copies of this corpus. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$800.
*
     (2) Datasets for Generic Relation             Extraction (reACE)           was developed at The University of Edinburgh, Edinburgh,           Scotland. It consists of English broadcast news and newswire           data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh           Regularized ACE (reACE) mark-up has been applied.
       
The Edinburgh relation extraction (RE) task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and to recode it in a format such as a relational database or RDF triple store (a database for the storage and retrieval of Resource Description Framework (RDF) metadata) that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluation of automatic systems for RE in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and different notions of what constitutes a relation.
       
reACE solves this problem by converting data to a common document type using token standoff and including detailed linguistic markup while maintaining all information in the original annotation. The subsequent re-annotation process normalizes the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web.
       
The data in this corpus consists of newswire and broadcast news material from ACE 2004 Multilingual Training Corpus LDC 2005T09 and ACE 2005 Multilingual Training Corpus LDC2006T06 . This material has been standardized for evaluation of multi-type RE across domains.
       
Annotation includes (1) a refactored version of the original data to a common XML document type; (2) linguistic information from LT-TTT (a system for tokenizing text and adding markup) and MINIPAR (an English parser); and (3) a normalized version of the original RE markup that complies with a shared notion of what constitutes a relation across domains.
       
The data sources represented in the corpus were collected by LDC in 2000 and 2003 and consist of the following: ABC, Agence France Presse, Associated Press, Cable News Network, MSNBC/NBC, New York Times, Public Radio International, Voice of America and Xinhua News Agency.
       
Datasets for Generic Relation Extraction (reACE) is distributed via web download. 2011 Subscription Members will automatically receive two copies of this corpus on disc. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$800.
       
          
The Edinburgh relation extraction (RE) task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and to recode it in a format such as a relational database or RDF triple store (a database for the storage and retrieval of Resource Description Framework (RDF) metadata) that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluation of automatic systems for RE in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and different notions of what constitutes a relation.
reACE solves this problem by converting data to a common document type using token standoff and including detailed linguistic markup while maintaining all information in the original annotation. The subsequent re-annotation process normalizes the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web.
The data in this corpus consists of newswire and broadcast news material from ACE 2004 Multilingual Training Corpus LDC 2005T09 and ACE 2005 Multilingual Training Corpus LDC2006T06 . This material has been standardized for evaluation of multi-type RE across domains.
Annotation includes (1) a refactored version of the original data to a common XML document type; (2) linguistic information from LT-TTT (a system for tokenizing text and adding markup) and MINIPAR (an English parser); and (3) a normalized version of the original RE markup that complies with a shared notion of what constitutes a relation across domains.
The data sources represented in the corpus were collected by LDC in 2000 and 2003 and consist of the following: ABC, Agence France Presse, Associated Press, Cable News Network, MSNBC/NBC, New York Times, Public Radio International, Voice of America and Xinhua News Agency.
Datasets for Generic Relation Extraction (reACE) is distributed via web download. 2011 Subscription Members will automatically receive two copies of this corpus on disc. 2011 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$800.
*
            (3) English Gigaword Fifth Edition is a comprehensive archive of           newswire text data that has been acquired over several years           by the LDC at the University of Pennsylvania. The fifth           edition includes all of the contents in English Gigaword           Fourth Edition (LDC2009T13) plus new data covering the 24-month           period of January 2009 through December 2010.
       
The seven distinct international sources of English newswire included in this edition are the following:
The seven distinct international sources of English newswire included in this edition are the following:
- Agence France-Presse, English Service (afp_eng)
- Associated Press Worldstream, English Service (apw_eng)
- Central News Agency of Taiwan, English Service (cna_eng)
- Los Angeles Times/Washington Post Newswire Service (ltw_eng)
- Washington Post/Bloomberg Newswire Service (wpb_eng)
- New York Times Newswire Service (nyt_eng)
- Xinhua News Agency, English Service (xin_eng)
     The seven letter codes in the           parentheses above include the three-character source name           abbreviations and the three-character language code ("eng")           separated by an underscore ("_") character. The three-letter           language code conforms to LDC's internal convention based on           the ISO 639-3 standard.
       
Data
             
The following table sets forth the overall totals for each source. Note that "Total-MB" refers to the quantity of date when unzipped (approximately 26 gigabytes), "Gzip-MB" refers to compressed file sizes as stored on the DVD-ROMs and "K-wrds" refers to the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated:
        Data
The following table sets forth the overall totals for each source. Note that "Total-MB" refers to the quantity of date when unzipped (approximately 26 gigabytes), "Gzip-MB" refers to compressed file sizes as stored on the DVD-ROMs and "K-wrds" refers to the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated:
| Source | #Files | Gzip-MB | Totl-MB | K-wrds | #DOCs | ||||||
|  | |||||||||||
| afp_eng | 146 | 1732 | 4937 | 738322 | 2479624 | ||||||
| apw_eng | 193 | 2700 | 7889 | 1186955 | 3107777 | ||||||
| cna_eng | 144 | 86 | 261 | 38491 | 145317 | ||||||
| ltw_eng | 127 | 651 | 1694 | 268088 | 411032 | ||||||
| nyt_eng | 197 | 3280 | 8938 | 1422670 | 1962178 | ||||||
| wpb_eng | 12 | 42 | 111 | 17462 | 26143 | ||||||
| xin_eng | 191 | 834 | 2518 | 360714 | 1744025 | ||||||
|  | |||||||||||
| TOTAL | 1010 | 9325 | 26348 | 4032686 | 9876086 | ||||||
 
No comments:
Post a Comment