Use this URL to cite or link to this record in EThOS:
Title: Audio-visual expressed emotion classification
Author: Haq, Sana-ul
ISNI:       0000 0004 2709 2638
Awarding Body: University of Surrey
Current Institution: University of Surrey
Date of Award: 2011
Availability of Full Text:
Access from EThOS:
Access from Institution:
Recent advancement in human-computer interaction technologies goes beyond the successful transfer of data between human and machine by seeking to improve the naturalness and friendliness of user interactions. The user's expressed emotion plays an important role by allowing people to express themselves beyond the verbal domain. In the field of emotion recognition, most of the research is based on unimodal approaches and less progress has been made in terms of multimodal approaches. This thesis aims to achieve better emotion classification by adopting an audio-visual approach. For this purpose, the Surrey Audio-Visual Expressed Emotion database (SAVEE) has been recorded from four English male speakers. The database consists of 480 British English utterances in seven emotions (Ekman's six basic emotions plus neutral). The sentences were chosen from the TIMIT corpus and were phonetically-balanced for each emotion. The data were processed and labelled. The quality of recordings was evaluated in terms of expressed emotions by 20 subjects (10 male, 10 female). Average subjective classification accuracy of 67% was achieved with audio, 88% with visual, and 92% with audio-visual data for the seven emotions. Results indicated good agreement with the actor's intended emotions over the database. As a first step, speaker-dependent emotion classification was performed to develop a baseline method for audio-visual emotion classification, and to investigate different ways of audio-visual fusion. The method consisted of feature extraction (audio and visual), feature selection by Plus l-Take Away r algorithm, feature reduction by Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA), and classification by Gaussian classifier. Audio-visual fusion at decision level performed better than fusion at feature level and after feature selection. Audio features incorporated pitch, duration, energy and mel frequency cepstral coefficients (MFCCs), and visual features were based on the 2D marker coordinates. Features were selected using three different criteria: Bhattacharyya distance, Mahalanobis distance, and KL-divergence. The Mahalanobis distance performed better than the other two criteria. In general, LDA performed better than PCA, and an average classification accuracy of 61 % was achieved with audio, 99% with visual, and 99% with audio-visual (decision-level fusion) for seven emotions using the SAVEE database. These results were achieved with the features selected by Mahalanobis distance. Speaker-independent experiments were performed on two databases: Berlin and SAVEE. The Berlin database has audio recordings in seven emotions, and has been widely used for audio emotion analysis. Additional audio features were extracted related to intensity, loudness, probability of voicing, line spectral frequencies and zero-crossing rate, and visual features including marker angle and PCA features. The extracted features were speaker-normalised, and classification was performed with two methods: Gaussian classifier and SVM. For the Berlin database, the best performance achieved with the Gaussian classifier was 86 % and with the SVM classifier was 87 % for seven emotions. For the SAVEE database, the SVM classifier performed much better than the Gaussian classifier, and the polynomial kernel performed better than the RBF kernel. The performance of features selected by Mahalanobis distance was better than those selected by Bhattacharyya distance. For seven emotions, average classification accuracy achieved with the SVM classifier was 67 % for audio, 68 % for visual, and 87 % for audio-visual (feature-level fusion). The subjective, speaker-dependent and speaker-independent experiments indicate that the SAVEE database contains good quality recordings of expressed emotions. The results indicate that both audio and visual modalities play an important role to convey emotions, and better emotion classification is achieved with the bimodal approach. Key words: multimodal emotion analysis, data recording, facial expressions, feature selection, audio-visual fusion, SVM.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available