Computational auditory scene analysis : a representational approach
This thesis addresses the problem of how a listener groups together acoustic components which have arisen from the same environmental event, a phenomenon known as auditory scene analysis. A computational model of auditory scene analysis is presented, which is able to separate speech from a variety of interfering noises. The model consists of four processing stages. Firstly, the auditory periphery is simulated by a bank of bandpass filters and a model of inner hair cell function. In the second stage, physiologically-inspired models of higher auditory organization - aiditory maps - are used to provide a rich representational basis for scene analysis. Periodicities in the acoustic input are coded by an ant ocorrelation map and a crosscorrelation map. Information about spectral continuity is extracted by a frequency transition map. The times at which acoustic components start and stop are identified by an onset map and an offset map. In the third 8tage of processing, information from the periodicity and frequency transition maps is used to characterize the auditory scene as a collection of symbolic auditory objects. Finally, a search strategy identifies objects that have similar properties and groups them together. Specifically, objects are likely to form a group if they have a similar periodicity, onset time or offset time. The model has been evaluated in two ways, using the task of segregating voiced speech from a number of interfering sounds such as random noise, "cocktail party" noise and other speech. Firstly, a waveform can be resynthesized for each group in the auditory scene, so that segregation performance can be assessed by informal listening tests. The resynthesized speech is highly intelligible and fairly natural. Secondly, the linear nature of the resynthesis process allows the signal-to-noise ratio (SNR) to be compared before and after segregation. An improvement in SNR is obtained after segregation for each type of interfering noise. Additionally, the performance of the model is significantly better than that of a conventional frame-based autocorrelation segregation strategy.