Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.680909
Title: Improving multilingual sentiment analysis using linguistic knowledge
Author: Di Bari, Marilena
ISNI:       0000 0004 5917 6668
Awarding Body: University of Leeds
Current Institution: University of Leeds
Date of Award: 2015
Availability of Full Text:
Access from EThOS:
Access from Institution:
Abstract:
The need for the automatic analysis of opinions in written texts, which has been growing in recent years in several domains, has made Sentiment Analysis a very popular field (Liu 2012). In this area, systems have been traditionally classifying sentences as positive or negative only in accordance to the sentiment that words most frequently assume (e.g. “angry” negative, “beautiful” positive). Such strategies present two main limitations: 1. Multiple opinions often appear in the same sentence, with each expressing an opposing sentiment on different subjects (e.g. a positive opinion is expressed on the plot of a film, but a negative one on the actors' performance). 2. The most frequent sentiment, collected in sentiment dictionaries, does not take into account the fact that context often alters the orientation. Sentiment dictionaries have also been demonstrated to have small coverage (Di Bari, Sharoff et al. 2013, Di Bari 2015). As a consequence, I propose an automatic system based on deep linguistic knowledge given in particular by dependency parsing relations (Nivre 2005) and by attributes taken from the Appraisal framework (Martin and White 2005), a theory concerned with the language of evaluation, attitude and emotion within Systemic Functional Linguistics (Halliday 1978). As a basis for the creation of the automatic system, I tailored an annotation scheme called SentiML inspired by previous works (Whitelaw, Garg et al. 2005, Bloom, Garg et al. 2007, Bloom and Argamon 2009) and carried out the annotation task in three languages (English, Italian and Russian) by using MAE (Stubbs 2011). The resulting corpora consist of around 500 sentences and 9000 tokens for each language. The corpora contain both original texts and translations of different types: news, political speeches and TED talks (Cettolo, Girardi et al. 2012). The foundation of SentiML lies in the fact that an opinion can be captured in a pair consisting of usually two words with different functions: a target as the expression the sentiment refers to, and a modifier as the expression conveying the sentiment. The pair consisting of the target and the modifier altogether is called appraisal group. Along with these main categories, the annotation includes their attributes, among which the most important are the appraisal type according to the Appraisal framework (‘affect’, ‘appreciation’, ‘judgement’) and the orientation (‘positive’ or ‘negative’, both out-of-context and contextual). A detailed manual analysis of the translation strategies (Baker 2002) and the appraisal types across the corpora, supported by insights from Corpus Linguistics has been carried out. The most interesting expressions found during such analysis have been automatically analysed afterwards with the aim of having a further evaluation of the system. Nonetheless, the main evaluation consists of a comparison with a rule-based system that makes use of already existing tools such as the part-of-speech (POS) tagger and the sentiment dictionary. The main objective of this work is to demonstrate that the Appraisal framework and Sentiment analysis can successfully support each other. The additional consideration that this has been done not only for English, but in parallel for Italian and Russian (and as one of the first applications of the Appraisal Framework in these languages) and for different text types, makes the research unique. Moreover, because the methodology used to compare a variety of linguistic features (morphological, grammatical, lexical, syntactical) at work in sentiment analysis has been applied to three languages belonging to different families (Germanic, Romance and Slavonic), it is expected to be generalizable to other languages. As far as the practical applications are concerned, the automatic system could be used in any field in which written opinions need to be analysed. In the meanwhile, the new individual resources such as the annotated corpora and the Maltparser models for Italian and Russian have been made publicly available.
Supervisor: Sharoff, Serge ; Thomas, Martiin Sponsor: University of Leeds
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.680909  DOI: Not available
Share: