Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.653783
Title: Statistical models for case ambiguity resolution in Korean
Author: Lee, K.
Awarding Body: University of Edinburgh
Current Institution: University of Edinburgh
Date of Award: 2005
Availability of Full Text:
Full text unavailable from EThOS.
Please contact the current institution’s library for further details.
Abstract:
This thesis deals with the resolution of case ambiguity in Korean. Even though Korean is a case marked language, in which phonetically recognisable case markers (case particles in Korean) mark cases explicitly, nominal words without any accompanying case particles are used frequently in naturally occurring texts and speech. When the case particles are not present, it is basically a mater of conjecture to infer the grammatical function of the nominal words. The position of a nominal word itself cannot give much help as Korean is a relatively free word order language. The case ambiguity problem has brought a great controversy in Korean linguistics and has been regarded as an unavoidable obstacle for automatic processing of the Korean language. To tackle the case ambiguity problem, we adopt knowledge-lean statistical methods. Our challenge is to construct effective statistical methods for case ambiguity resolution in Korean without using fully annotated training material that does not exist currently. In other words, we attempt to train our statistical models on the training data collected from non-ambiguous case marking instances in a corpus. As a first step, after briefly surveying the concept of case in general and linguistic properties of Korean, we investigate the case ambiguity problem in Korean in detail while examining some relevant theoretical linguistic work. By doing this, we precisely identify and define our case ambiguity resolution task. Second, we present our data collection method that does not depend on any high-level language processing tools other than a standard part-of-speech tagger and simple heuristic rules reflecting the structural characteristic of the Korean language. Acquired data set is inevitably imperfect. It contains a considerable amount of noise and there is no guarantee that the data set will provide appropriate statistical information for case ambiguity resolution task since it is derived only from the set of unambiguous instances. We show that these limitations can be overcome by using a fairly large corpus of Korean consisting of 60,000,000 words. Third, we propose two statistical case ambiguity resolution methods: independent case decision method and sequential case decision method.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.653783  DOI: Not available
Share: