Generic named entity extraction
This thesis proposes and evaluates different ways of performing generic named entity
recognition, that is the construction of a system capable of recognising names in free
text which is not specific to any particular domain or task.
The starting point is an implementation of a well known baseline system which is based
on maximum entropy models that utilise lexically-oriented features to recognised names
in text. Although this system achieves good levels of performance, both maximum
entropy models and lexically-oriented features have their limitations. Three alternative
ways in which this system can be extended to overcome these limitations are then
[> more linguistically-oriented features are extracted from a generic lexical source,
namely WordNet®, and then added to the pool of features of the maximum entropy
[> the maximum entropy model is bias towards training samples that are similar to
the piece of text being analysed
[> a bootstrapping procedure is introduced to allow maximum entropy models to
collect new, valuable information from unlabelled text
Results in this thesis indicate that the maximum entropy model is a very strong approach
that accomplishes levels of performance that are very hard to improve on. However,
these results also suggest that these extensions of the baseline system could yield improvements,
though some difficulties must be addressed and more research is needed to
obtain more assertive conclusions.
This thesis has nonetheless provided important contributions: a novel approach to
estimate the complexity of a named entity extraction task, a method for selecting the
features to be used by the maximum entropy model from a large pool of features and a
novel procedure to bootstrap maximum entropy models.