New publications:
LDC2012T11
- American English Nickname Collection -
- American English Nickname Collection -
LDC2012T07
- Arabic Treebank - Broadcast News v1.0 -
- Arabic Treebank - Broadcast News v1.0 -
LDC2012T10
- Catalan TimeBank 1.0 -
- Catalan TimeBank 1.0 -
LDC announces its 20th Anniversary Workshop
on Language
Resources, to be held in Philadelphia on September 6-7,
2012. The event will
commemorate our anniversary, reflect on the beginning of language
data centers
and address the future of language resources.
Workshop themes will include: the developments
in human
language technologies and associated resources that have brought
us to our
current state; the language resources required by the technical
approaches
taken and the impact of these resources on HLT progress; the
applications of
HLT and resources to other disciplines including law, medicine,
economics, the
political sciences and psychology; the impact of HLTs and related
technologies
on linguistic analysis and novel approaches in fields as
widespread as
phonetics, semantics, language documentation, sociolinguistics and
dialect
geography; and finally, the impact of any of these developments on
the ways in
which language resources are created, shared and exploited and on
the specific
resources required.
Stay tuned for further details.
New publications
(1) American
English
Nickname Collection was developed by Intelius, Inc. and is a
compilation of
American English nicknames to given name mappings based on
information in US
government records, public web profiles and financial and property
reports.
This corpus is intended as a tool for the quantitative study of
nickname usage
in the United States such as in demographic and sociological
studies.
The American English Nickname Collection
contains 331,237
distinct mappings encompassing millions of names. The data was
collected and
processed through a record linkage pipeline. The steps in the
pipeline were (1)
data cleaning, (2) blocking, (3) pair-wise linkage and (4)
clustering. In the
cleaning step, material was categorized, processed to remove junk
and spam
records and normalized to an approximately common representation.
The blocking
process utilized an algorithm to group records by shared
properties for
determining which record pairs should be examined by the pairwise
linker as
potential duplicates. The linkage step assigned a score to record
pairs using a
supervised pairwise-based machine learning model. The clustering
step combined
record pairs into connected components and further partitioned
each connected
component to remove inconsistent pairwise links. The result is
that input
records were partitioned into disjoint sets called profiles, where
each profile
corresponded to a single person.
The material is presented in the form of a
comma delimited
text file. Each line contains a first name, a nickname or alias,
its
conditional probability and its frequency. The conditional
probability for each
nickname is derived from the base data using an algorithm which
calculates both
the probability for which any alias refers to a given name and a
threshold
below which the mapping is most likely an error. This threshold
eliminates
typographic errors and other noise from the data.
American English Nickname Collection is
distributed via web
download. 2012 Subscription Members will receive two
copies of this
data on disc provided that they have submitted a completed copy of
the User
License
Agreement for American English Nickname Collection (LDC2012T11).
2012 Standard Members may request a copy as part of their 16 free
membership
corpora. Non-members may license this data by completing the User
License
Agreement for American English Nickname Collection (LDC2012T11). The
agreement can be faxed to +1 215 573 2175 or scanned and emailed
to ldc @ ldc . upenn . edu. The collection is being made available at no charge.
*
(2) Arabic
Treebank
- Broadcast News v1.0 was developed at LDC. It consists of
120
transcribed Arabic broadcast news stories with part-of-speech,
morphology,
gloss and syntactic tree annotation in accordance with the Penn Arabic
Treebank
(PATB) Morphological and Syntactic Annotation Guidelines.
The ongoing PATB
project supports research in Arabic-language natural language
processing and
human language technology development.
This release contains 432,976 source tokens
before clitics
were split, and 517,080 tree tokens after clitics were separated
for treebank
annotation. The source materials are Arabic broadcast news stories
collected by
LDC during the period 2005-2008 from the following sources: Abu
Dhabi TV, Al
Alam News Channel, Al Arabiya, Al Baghdadya TV, Al Fayha, Alhurra,
Al Iraqiyah,
Aljazeera, Al Ordiniyah, Al Sharqiyah, Dubai TV, Kuwait TV,
Lebanese
Broadcasting Corp., Oman TV, Radio Sawa, Saudi TV and Syria TV.
The transcripts
were produced by LDC.
Arabic Treebank - Broadcast News v1.0 is
distributed via web
download. 2012 Subscription Members will receive two
copies of this
data on disc. 2012 Standard Members may request a copy as part of
their 16 free
membership corpora.
*
(3) Catalan
TimeBank
1.0 was developed by researchers at Barcelona Media and
consists of
Catalan texts in the AnCora
corpus annotated with temporal and event information
according to the TimeML
specification language.
TimeML is a schema for annotating eventualities
and time
expressions in natural language as well as the temporal relations
among them,
thus facilitating the task of extraction, representation and
exchange of
temporal information. Catalan Timebank 1.0 is annotated in three
levels,
marking events, time expressions and event metadata. The TimeML
annotation
scheme was tailored for the specifics of the Catalan language.
Temporal
relations in Catalan present distinctions of verbal mood (e.g.,
indicative,
subjunctive, conditional, etc.) and grammatical aspect (e.g.,
imperfective)
which are absent in English.
Catalan TimeBank 1.0 contains stand-off
annotations for 210
documents with over 75,800 tokens (including punctuation marks)
and 68,000
tokens (excluding punctuation). The source documents are from the
EFE
news
agency, the ACN
Catalan news agency2 and the Catalan version of the El Períodico
newspaper, and span the
period from January to December 2000.
The AnCora corpus is the largest multilayer
annotated corpus
of Spanish and Catalan. AnCora contains 400,000 words in Spanish
and 275,000
words in Catalan. The AnCora documents are annotated on many
linguistic levels
including structure, syntax, dependencies, semantics and
pragmatics. That
information is not included in this release, but it can be mapped
to the
present annotations. The corpus is freely available from the Centre de Llenguatge i
Computació (CLiC)".
Catalan TimeBank 1.0 is distributed by web
download. 2012 Subscription Members will receive two
copies of this
data on disc. 2012 Standard Members may request a copy as part of
their
16 free membership corpora. Non-members may license this data by
completing the LDC
User
Agreement for Non-members. The agreement can be faxed to +1
215
573 2175 or scanned and emailed to ldc @ ldc . upenn . edu. The collection
is being
made available at no charge.