Modelling asynchrony in the articulation of speech for automatic speech recognition
Current automatic speech recognition systems make the assumption that all the articulators in
the vocal tract move in synchrony with one another to produce speech. This thesis describes
the development of a more realistic model that allows some asynchrony between the
articulators with the aim of improving speech recognition accuracy.
Experiments on the TEVHT database demonstrate that higher phone recognition accuracy is
obtained by separate modelling of the voiced and voiceless components of speech by splitting
the speech spectrum into high and low frequency bands.
To model further articulator asynchrony in speech production requires a representation of
speech that is closer to the actual production process. Formant frequency parameters are
integrated into typical Mel-frequency cepstral coefficient representation and their effect on
recognition accuracy observed. The formant frequency estimates can only accurately be
made when the formants are visible in the spectrum, so a technique is developed to ignore
frequency estimates generated when the formants are not visible. The formant data allows a
unique method of vocal tract normalization, which improves recognition accuracy.
Finally a classification experiment examines the potential improvement in speech recognition
accuracy of modelling asynchrony between the articulators by allowing asynchrony between
all the formants.