Use this URL to cite or link to this record in EThOS:
Title: Designing a general framework for text alignment : case studies with two South Asian languages
Author: Aswani, Niraj
ISNI:       0000 0004 2724 1576
Awarding Body: University of Sheffield
Current Institution: University of Sheffield
Date of Award: 2012
Availability of Full Text:
Access from EThOS:
Access from Institution:
Building machine translation systems for many South Asian languages (such as Hindi, Gujarati, etc.) using statistical methods is problematic. The primary reason is insufficient parallel data to learn accurate word alignment. Additionally, these languages are morphologically rich and have free word order. When it is difficult to rely purely on statistical methods due to insufficient data, research shows that better performance can be obtained by building hybrid systems that rely on language specific resources, such as morphological analysers or dictionaries, as well as statistical methods. However, it is difficult to find such language specific resources for many South Asian languages. Since languages such as Hindi, Gujarati, Urdu, Bengali, Punjabi and Marathi are all very similar in structure and the main differences lie in the script and vocabulary used for these languages, we hypothesise that it is possible to develop resources for one of these languages and generalize the approach to allow rapid bootstrapping of similar resources for the other closely related languages -- with minimal effort and similar accuracies. To verify this, we develop a few resources for the Hindi language, including a sentence alignment algorithm, a morphological analyser and a transliteration similarity component and generalize the approach to allow rapid bootstrapping of similar resources for the Gujarati language. We show that the approach works on both the Hindi and Gujarati languages and achieves results that are comparable to similar state-of-the-art (SOA) resources available for these languages. We also hypothesise that it is possible to develop a high performance hybrid word alignment algorithm that relies on such language specific resources. To verify this, we design, implement and evaluate a novel English-Hindi hybrid word alignment system that uses the Hindi specific resources developed by us. Not only do we show our word alignment system outperforms other SOA English-Hindi word alignment systems, but also how simple it is to adapt it to the English-Gujarati language pair.
Supervisor: Gaizauskas, Robert Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available