Class-based statistical models for lexical knowledge acquisition.
This thesis is about the automatic acquisition of a particular kind of lexical knowledge, namely
the knowledge of which noun senses can fill the argument slots of predicates. The knowledge
is represented using probabilities, which agrees with the intuition that there are no absolute constraints
on the arguments of predicates, but that the constraints are satisfied to a certain degree;
thus the problem of knowledge acquisition becomes the problem of probability estimation from
corpus data. The problem with defining a probability model in terms of senses is that this involves
a huge number of parameters, which results in a sparse data problem. The proposal here is to
define a probability model over senses in a semantic hierarchy, and exploit the fact that senses can
be grouped into classes consisting of semantically similar senses.
A novel class-based estimation technique is developed, together with a procedure that determines
a suitable class for a sense (given a predicate and argument position). The problem of
determining a suitable class can be thought of as finding a suitable level of generalisation in the
hierarchy. The generalisation procedure uses a statistical test to locate areas consisting of semantically
similar senses, and, as well as being used for probability estimation, is also employed as
part of a re-estimation algorithm for estimating sense frequencies from incomplete data.
The rest of the thesis considers how the lexical knowledge can be used to resolve structural
ambiguities, and provides empirical evaluations. The estimation techniques are first integrated into
a parse selection system, using a probabilistic dependency model to rank the alternative parses for
a sentence. Then, a PP-attachment task is used to provide an evaluation which is more focussed on
the class-based estimation technique, and, finally, a pseudo disambiguation task is used to compare
the estimation technique with alternative approaches.