Use this URL to cite or link to this record in EThOS:
Title: Automatic detection of English inclusions in mixed-lingual data with an application to parsing
Author: Alex, Beatrice
Awarding Body: University of Edinburgh
Current Institution: University of Edinburgh
Date of Award: 2008
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
The expansion of the Internet, coupled with an increased availability of electronic documents in various languages, has resulted in greater attention being paid to multi-lingual and language independent applications. However the automatic identification of foreign expressions, be they words or named entities beyond the capability of existing language identification techniques. This failure has inspired a recent growth in the development of new techniques capable of processing mixed-lingual text. This thesis presents an annotation-free classifier designed to identify English inclusions in other languages. The classifier consists of four sequential modules being pre-processing, lexical lookup, search engine classification and post-processing. These modules collectively identify English inclusions and are robust enough to work across different languages, as is demonstrated with German and French. However, its major advantage is its annotation-free characteristics. This means that it does not need any training, a step that normally requires an annotated corpus of examples. The English inclusion classifier presented here is the first of its type to be evaluated using real- world data. It has been shown to perform well on unseen data in both different languages and domains. Comparisons are drawn between this system and the two leading alternative classification techniques. This system compares favourably with the recently developed alternative technique of combined dictionary and n-gram based classification and is shown to have significant advantages over a trained machine learner. This thesis demonstrates why English inclusion classification is beneficial through a series of real- world examples from different fields. It quantifies in detail the difficulty that existing parsers have in dealing with English expressions occurring in foreign language text. This is underlined by a series of experiments using both treebank-induced and hand-crafted grammar based German parsers. It will be shown that optimisation of these parsers, in combination with the annotation- free classifier presented here, results in a significant improvement in performance. For these reasons, English inclusion detection is a valuable pre-processing step with many applications in a number of fields, the most significant of which are parsing, text-to-speech synthesis, machine translation and linguistics and lexicography.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available