Use this URL to cite or link to this record in EThOS:
Title: Automatic speaker verification based on waveform perturbation analysis
Author: Sutherland, Andrew Mackinnon
Awarding Body: University of Edinburgh
Current Institution: University of Edinburgh
Date of Award: 1989
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
This thesis describes the research carried out to assess the applicability of speech waveform perturbation analysis to the problem of automatic speaker verification. It also describes the development of the technique into an operational system. The techniques of waveform perturbation analysis have been studied in the past with a view to their use as potential indicators of vocal fold dysfunction. In essence, they quantify the natural cycle-to-cycle fluctuations of vocal fold vibrations, also known as pitch periods, and are thus related to the perceived hoarseness and roughness of the voice. A major aim of this thesis is to determine if such features offer a viable dimension in speaker space along which discrimination can take place. The field of speaker verification is reviewed, and a number of previously exploited techniques are described. The likely mechanism for the production of speech waveform perturbation, separated into 'jitter' of period durations and 'shimmer' of peak amplitude values, are examined, and the existing techniques of quantification reviewed. In order to ensure accurate cycle-to-cycle measurement of period durations in real time, a new pitch determination algorithm, based on multi-feature investigation of the waveform peaks, has been developed. Its accuracy was assessed using both previously used pitch determination algorithms, and a laryngograph device. It was found to offer high accuracy pitch synchronous period estimates. The quantification of perturbation was carried out using the technique of median smoothing. This technique approximates removal of the more gradual changes in pitch period due to intonation. Measures of residual irregularity are then combined with additional long term intonational measures, to form a 10-dimensional profile of the speaker. Using an all male population of 72 speakers, a verification accuracy of 87% was achieved. The system was trained on approximately 60 seconds of continuous speech (from each speaker), and tested on 10 second utterances. A number of approaches to allow accurate classifiction of talkers are discussed, and their relative merits investigated experimentally. The effects of time spacing the training and testing data are studied. Also, the effectiveness of extending the speaker profile to include long term, spectrum combinations of features. The techniques of feature selection, used to limit the effects of finite training data, are also explored and extended. Results are presented for a study which employed professionally trained mimics in order to assess the effectiveness of the system under the most stringent conditions. The distribution of error rates within a population (i.e. the existence of particularly inconsistent speakers, or 'goats') is also studied, with a view to minimising their detrimental effects on system efficacy. Finally, this thesis describes the translation of the above system into an operational near real-time system, employing a digital signal processing microprocessor device.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available