Use this URL to cite or link to this record in EThOS:
Title: Domain and genre dependency in Statistical Machine Translation
Author: Brunello, Marco
ISNI:       0000 0004 5354 743X
Awarding Body: University of Leeds
Current Institution: University of Leeds
Date of Award: 2014
Availability of Full Text:
Access from EThOS:
Access from Institution:
Statistical Machine Translation (SMT) is currently the most promising and widely studied paradigm in the broader field of Machine Translation, continuously explored in order to improve its performance and to find solutions to its current shortcomings, in particular the sparsity of big bilingual corpora in a variety of domains or genres to be used as training data. However, while one the main trends is still to rely as much as possible on already available large collections of data, even when they do not fit quite well specific translation tasks in terms of relatedness of content, the possibility of using less but appropriately selected training sets - depending on the textual variety of the documents that need to be translated case by case - has not been extensively explored as much so far. The goal of this research is to investigate whether this latter possibility, i.e. the lack of availability of large quantities of assorted data, can have a possible solution in the application of strategies commonly used in genre and domain classification (including unsupervised topic modeling and document dissimilarity techniques), in particular performing subsampling experiments on bilingual corpora in order to obtain a good fit between training data and the texts that need to be translated with SMT. For the purposes of this study, already existing freely available large corpora were found to be unsuitable for the selection of domain/document specifc subsamples, so two new parallel corpora - English-Italian and English-German - were compiled employing the \web as corpus" approach on websites containing translated content. Then some tests were made on documents belonging to different varieties, translated with SMT systems built using subsamples of training data selected using document dissimilarity measures in order to pick up the most suitable documents as training data. Such method has shown how the choice of subsampling strategy heavily depends on the text variety of each considered document, but it has also proven that better translation results can be obtained from small samples of training sets rather than using all the available data, which brings benefits also in terms of quicker training times and use of fewer computational resources.
Supervisor: Sharoff, Serge ; Babych, Bogdan ; Thomas, Martin Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available