Use this URL to cite or link to this record in EThOS:
Title: Audio-visual speaker separation
Author: Khan, Faheem
ISNI:       0000 0004 5923 2164
Awarding Body: University of East Anglia
Current Institution: University of East Anglia
Date of Award: 2016
Availability of Full Text:
Access from EThOS:
Access from Institution:
Communication using speech is often an audio-visual experience. Listeners hear what is being uttered by speakers and also see the corresponding facial movements and other gestures. This thesis is an attempt to exploit this bimodal (audio-visual) nature of speech for speaker separation. In addition to the audio speech features, visual speech features are used to achieve the task of speaker separation. An analysis of the correlation between audio and visual speech features is carried out first. This correlation between audio and visual features is then used in the estimation of clean audio features from visual features using Gaussian MixtureModels (GMMs) andMaximum a Posteriori (MAP) estimation. For speaker separation three methods are proposed that use the estimated clean audio features. Firstly, the estimated clean audio features are used to construct aWiener filter to separate the mixed speech at various signal-to-noise ratios (SNRs) into target and competing speakers. TheWiener filter gains are modified in several ways in search for improvements in quality and intelligibility of the extracted speech. Secondly, the estimated clean audio features are used in developing visually-derived binary masking method for speaker separation. The estimated audio features are used to compute time-frequency binary masks that identify the regions where the target speaker dominates. These regions are retained and formthe estimate of the target speaker’s speech. Experimental results compare the visually-derived binary masks with ideal binary masks which shows a useful level of accuracy. The effectiveness of the visually-derived binary mask for speaker separation is then evaluated through estimates of speech quality and speech intelligibility and shows substantial gains over the original mixture. Thirdly, the estimated clean audio features and the visually-derivedWiener filtering are used to modify the operation of an effective audio-only method of speaker separation, namely the soft mask method, to allow visual speech information to improve the separation task. Experimental results are presented that compare the proposed audio-visual speaker separation with the audio-only method using both speech quality and intelligibility metrics. Finally, a detailed comparison is made of the proposed and existing methods of speaker separation using objective and subjective measures.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available