Acoustic level speech recognition
A number of techniques have been developed over the last forty years which attempt
to solve the problem of recognizing human speech by machine. Although the general
problem of unconstrained, speaker independent connected speech recognition is still
not solved, some of the methods have demonstrated varying degrees of success on a
number of constrained speech recognition tasks.
Human speech communication is considered to take place on a number of levels
from the acoustic signal through to higher linguistic and semantic levels. At the acoustic
level, the recognition process can be divided into time-alignment (the removal of
global and local timing differences between the unknown input speech and the stored
reference templates) and referencete mplate matching. Little attention seems to have been
given to the effective use of acoustic level contextual information to improve the performance
of these tasks.
In this thesis, a new template matching scheme is developed which addresses
this issue and successfully allows the utilization of acoustic level context. The method,
based on Bayesian decision theory, is a dynamic time warping approach which incorporates
statistical dependencies in matching errors between frames along the entire length
of the reference template. In addition, the method includes a speaker compensation
technique operating simultaneously.
Implementation is carried out using the highly efficient branch and bound algorithm.
Speech model storage requirements are quite small as a result of an elegant
feature of the recursive matching criterion. Furthermore, a novel method for inferencing
the special speech models is introduced.
The new method is tested on data drawn from nearly 8000 utterances of the 26
letters of the British English Alphabet spoken by 104 speakers, split almost equally
between male and female speakers. Experiments show that the new approach is a
powerful acoustic level speech recognizer achieving up to 34% better recognition performance
when compared with a conventional method based on the dynamic programming