Speaker independent isolated word recognition
The work presented in this thesis concerns the recognition of isolated words using a pattern matching approach. In such a system, an unknown speech utterance, which is to be identified, is transformed into a pattern of characteristic features. These features are then compared with a set of pre-stored reference patterns that were generated from the vocabulary words. The unknown word is identified as that vocabulary word for which the reference pattern gives the best match. One of the major difficul ties in the pattern comparison process is that speech patterns, obtained from the same word, exhibit non-linear temporal fluctuations and thus a high degree of redundancy. The initial part of this thesis considers various dynamic time warping techniques used for normalizing the temporal differences between speech patterns. Redundancy removal methods are also considered, and their effect on the recognition accuracy is assessed. Although the use of dynamic time warping algorithms provide considerable improvement in the accuracy of isolated word recognition schemes, the performance is ultimately limited by their poor ability to discriminate between acoustically similar words. Methods for enhancing the identification rate among acoustically similar words, by using common pattern features for similar sounding regions, are investigated. Pattern matching based, speaker independent systems, can only operate with a high recognition rate, by using multiple reference patterns for each of the words included in the vocabulary. These patterns are obtained from the utterances of a group of speakers. The use of multiple reference patterns, not only leads to a large increase in the memory requirements of the recognizer, but also an increase in the computational load. A recognition system is proposed in this thesis, which overcomes these difficulties by (i) employing vector quantization techniques to reduce the storage of reference patterns, and (ii) eliminating the need for dynamic time warping which reduces the computational complexity of the system. Finally, a method of identifying the acoustic structure of an utterance in terms of voiced, unvoiced, and silence segments by using fuzzy set theory is proposed. The acoustic structure is then employed to enhance the recognition accuracy of a conventional isolated word recognizer.