Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.580661
Title: An evolutionary approach to automatic Chinese text segmentation
Author: Zhang, Dong
Awarding Body: University of London
Current Institution: University of London
Date of Award: 2012
Availability of Full Text:
Access from EThOS:
Abstract:
Textual information written in Chinese now represents a huge knowledge repository. The first step of managing and processing information in written Chinese text is segmentation. The thesis investigates three main issues in Chinese text segmentation: word frequency estimation, ambiguity resolution, and unknown word identification. The latter two issues are addressed in the same segmentation process. Defining Chinese word is a very difficult task. This makes estimating the correct word frequency a challenging task. A main source to obtain the frequencies of words is by constructing Chinese corpus. Many manually segmented Chinese corpora have been produced by different organisations and institutes. The word frequencies obtained from the different standards, however, are not easy to integrate. In this thesis a method is proposed by using multiple corpora to achieve better estimation on word frequencies. The proposed method eliminates the 'human factor' in the process of constructing corpus, thus providing significant saving in human labour while producing text sources for defining Chinese words. The result indicates that by utilising corpora of different types a more balanced word list could be produced. A new method for automatic Chinese text segmentation using evolutionary algorithms and Web search statistical data is outlined in this thesis. This proposed method considers Web text a de facto corpus that updates automatically, thus eliminating the need for statistics training. It treats the segmentation as a process that finds out the best probability of how individual characters are combined into sentences, paragraphs, and articles, thus producing segmentation results that are tailored to the text in question and are independent of segmentation standards.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.580661  DOI: Not available
Share: