Generic named entity extraction
This thesis proposes and evaluates different ways of performing generic named entity recognition, that is the construction of a system capable of recognising names in free text which is not specific to any particular domain or task. The starting point is an implementation of a well known baseline system which is based on maximum entropy models that utilise lexically-oriented features to recognised names in text. Although this system achieves good levels of performance, both maximum entropy models and lexically-oriented features have their limitations. Three alternative ways in which this system can be extended to overcome these limitations are then studied: [ > more linguistically-oriented features are extracted from a generic lexical source, namely WordNet®, and then added to the pool of features of the maximum entropy model [ > the maximum entropy model is bias towards training samples that are similar to the piece of text being analysed [ > a bootstrapping procedure is introduced to allow maximum entropy models to collect new, valuable information from unlabelled text Results in this thesis indicate that the maximum entropy model is a very strong approach that accomplishes levels of performance that are very hard to improve on. However, these results also suggest that these extensions of the baseline system could yield improvements, though some difficulties must be addressed and more research is needed to obtain more assertive conclusions. This thesis has nonetheless provided important contributions: a novel approach to estimate the complexity of a named entity extraction task, a method for selecting the features to be used by the maximum entropy model from a large pool of features and a novel procedure to bootstrap maximum entropy models.