Speech recognition by computer : algorithms and architectures
This work is concerned with the investigation of algorithms and architectures for computer recognition of human speech. Three speech recognition algorithms have been implemented, using (a) Walsh Analysis, (b) Fourier Analysis and (c) Linear Predictive Coding. The Fourier Analysis algorithm made use of the Prime-number Fourier Transform technique. The Linear Predictive Coding algorithm made use of LeRoux and Gueguen's method for calculating the coefficients. The system was organised so that the speech samples could be input to a PC/XT microcomputer in a typical office environment. The PC/XT was linked via Ethernet to a Sun 2/180s computer system which allowed the data to be stored on a Winchester disk so that the data used for testing each algorithm was identical. The recognition algorithms were implemented entirely in Pascal, to allow evaluation to take place on several different machines. The effectiveness of the algorithms was tested with a group of five naive speakers, results being in the form of recognition scores. The results showed the superiority of the Linear Predictive Coding algorithm, which achieved a mean recognition score of 93.3%. The software was implemented on three different computer systems. These were an 8-bit microprocessor, a sixteen-bit microcomputer based on the IBM PC/XT, and a Motorola 68020 based Sun Workstation. The effectiveness of the implementations was measured in terms of speed of execution of the recognition software. By limiting the vocabulary to ten words, it has been shown that it would be possible to achieve recognition of isolated utterances in real time using a single 68020 microprocessor. The definition of real time in this context is understood to mean that the recognition task will on average, be completed within the duration of the utterance, for all the utterances in the recogniser's vocabulary. A speech recogniser architecture is proposed which would achieve real time speech recognition without any limitation being placed upon (a) the order of the transform, and (b) the size of the recogniser's vocabulary. This is achieved by utilising a pipeline of four processors, with the pattern matching process performed in parallel on groups of words in the vocabulary.