Corpus-based machine translation evaluation via automated error detection in output texts
Since the emergence of the first fully automatic machine translation (MT) systems over fifty years ago, the use of MT has increased dramatically. Consequently, the evaluation of MT systems is crucial for all stakeholders. However, the human evaluation of MT output is expensive and time-consuming, often relying on subjective quality judgements and requiring human `reference translations' against which the output is compared. As a result, interest in more recent years has turned towards automated evaluation methods, which aim to produce scores that reflect human quality judgements. As the majority of published automated evaluation methods still require human `reference translations' for comparison, the goal of this research is to investigate the potential of a method that requires access only to the translation. Based on detailed corpus analyses, the primary aim is to devise methods for the automated detection of particular error types in French-English MT output from competing systems and to explore correlations between automated error counts and human judgements of a translation as a whole. First, a French-English corpus designed specifically for MT evaluation was compiled. A sample of MT output from the corpus was then evaluated by humans to provide judgements against which automated scores would ultimately be compared. A datadriven fluency error classification scheme was subsequently developed to enable the consistent manual annotation of errors found in the English MT output, without access to the original French text. These annotations were then used to guide the selection of error categories for automated error detection, and to facilitate the analysis of particular error types in context so that appropriate methods could be devised. Manual annotations were further used to evaluate the accuracy of each automated approach. Finally, error detection algorithms were tested on English MT output from German, Italian and Spanish to determine the extent to which methods would need to be adapted for use with other language pairs.