Use this URL to cite or link to this record in EThOS:
Title: Modelling of glimpses for speech recognition in noisy environments
Author: Laidler, Jonathan
ISNI:       0000 0004 2742 5084
Awarding Body: University of Sheffield
Current Institution: University of Sheffield
Date of Award: 2012
Availability of Full Text:
Access from EThOS:
Noisy environments pose significant problems to automatic speech recognition (ASR) systems. A common scenario is the cocktail party problem, where there are competing speakers. Human listeners perform well in these situations despite the fact that the target and the noise share similar characteristics. However, traditional ASR systems struggle to deal with non-stationary noise. The glimpsing theory of speech perception states that human listeners are able to focus their attention on spectre-temporal glimpses where the target speech is not masked by noise. Glimpses of clean speech are highly available when the noise source is another speaker, due to the sparse nature of spectra-temporal representations of speech. ASR systems which aim to model the behaviour of human listeners should also take advantage of glimpses. Existing studies have detected glimpses based on features such as pitch, which is known to be valuable for separating competing talkers. This thesis takes the opposite approach, using no prior knowledge of speech features but rather learning the features of glimpses from samples of clean speech. This is considered to be a model-driven approach, in contrast to previous source-driven approaches. This thesis draws inspiration from computational vision, where the analogous problem is that of partial object recognition. The proposed glimpse detection system identifies spectre-temporal interest points which are small patches of speech, then forms glimpses from connected regions of interest points. In addition to a detailed description of the novel ASR framework, the thesis presents three new investigations. The first discovers what size of spectra-temporal speech patch can be recognised by human listeners. The second investigates what kind of encoding should be applied to patches in order to best capture the features of clean speech. The third takes grouping algorithms that are popular in vision research and compares their success in creating glimpses from speech interest points. Finally the full end-to-end ASR system is evaluated on a speech separation task.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available