Use this URL to cite or link to this record in EThOS:
Title: Analysis, modelling and animation of emotional speech in 3D
Author: Nadtoka, Nataliya
ISNI:       0000 0004 2716 7003
Awarding Body: University of Surrey
Current Institution: University of Surrey
Date of Award: 2011
Availability of Full Text:
Access from EThOS:
Access from Institution:
This thesis investigates the problem of producing perceptually realistic facial animation of expressions and speech. It spans several different areas of work, from capture and representation of facial dynamics through to analysis and synthesis of expressive 3D animation sequences. For this purpose, a database of 3D facial scans was collected from 16 subjects each performing 7 expressions. Ekman’s set of 6 cross-culturally recognised emotions and a neutral emotion were used. Several representations of facial expressions are compared: morphable model, its extension to tensor space and so on. A multilinear tensor-based morphable model is a powerful tool as it permits to independently control identity and expression. However, its high computational cost and non-intuitive set of parameters have motivated us to opt for a standard 3D morphable modelling approach. We propose a novel algorithm for mapping between motion capture data, projected to spatially low resolution (19 markers) 3D model space, and spatially high resolution (3300 vertices and colour texture) 3D morphable model space. This radial basis function based mapping preserves the temporal characteristics of motion capture data and the level of detail of high resolution 3D scans. The single-subject model is extended to animate other subjects based on a single 3D scan or a photograph. An additional model is needed to represent the variation between individual expression styles. The relation between audio and visual features is analysed based on a 4D dataset of expressive speech. The dataset consists of 3D scans of a single subject, recorded at 60 Hz, and a synchronised audio at 44.1 kHz. The speech corpus contains 235 phonetically balanced expressive English sentences, recorded in 6 emotions and neutral. Audio features consist of fundamental frequency F0, duration, energy and Mel-frequency cepstral coefficients. Face was separated into overlapping facial regions. Visual signal was then used to compute temporal visual features for each facial region. We concentrate on the upper face region due to its high expressive content and lesser contamination by articulation. Phoneme, word and sentence level audio-visual analysis is performed within each emotional category and among all emotional categories. Although, initial results show a promising connection between dynamics of audio and visual features for some emotions, significant intra-class variation exists for the others. Results demonstrate that dynamics and intensity of expressive content within and across sentences are highly influenced by their linguistic content. This work shows that the effect of temporal variation of expressive content is statistically significant and should be taken into account in visual speech synthesis. Further investigation is necessary with a more controlled setup. This thesis provides the foundation for further research towards the understanding of the connection between expressive content and visual dynamics during speech and achieving perceptually realistic animation of a talking head.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available