Use this URL to cite or link to this record in EThOS:
Title: Unsupervised categorization of word meanings using statistical and neural network methods
Author: Huckle, Christopher Cedric
Awarding Body: University of Edinburgh
Current Institution: University of Edinburgh
Date of Award: 1996
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
A statistical technique is introduced for representing the contexts in which words occur. Each word is represented by a 'statistical context vector', and the vectors are subjected to hierarchical cluster analysis to produce a structure in which words which have similar contexts are placed closer together than those which do not. Analyses of this type are carried out on a 10,000,000 word corpus, using a variety of different parameters, and the appropriateness of the resulting structures is assessed using Roget's Thesaurus as a benchmark. A still more attractive approach is one which deals with polysemy, and which develops its representations for word meanings continuously from the outset, with no need for a separate stage of statistical analysis. To take these consideration into account, an unsupervised neural network is presented, in which different senses of a word token are assigned to different output clusters as the contexts of their occurrence dictate. After initial testing using Elman's (1988) artificial corpus, the network's performance is assessed on the 10,000,000 word corpus by comparing the ways in which different word tokens are distributed over the output units. Further analyses are carried out in which a crude measure of this distribution is assessed using Jones' (1985) 'Ease of Predication' measure. Ease of Predication is found to account for a significant amount of the variance in the distribution measure. Word frequency is also found to play a significant role, and word frequency effects are reassessed in the light of this. The psychological implications of the results obtained from the network are discussed. It is concluded that there is a great deal of information inherent in the structure of language which could potentially play an important part in developing a conceptual structure for word meanings. Whilst extralinguistic information is undoubtedly likely to be of importance as well, it is striking that the use of very simple statistical measures can permit the development of such rich structures.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available