Speech features and their significance in speaker recognition
This thesis addresses the significance of speech features within the task of speaker
recognition. Motivated by the perception of simple attributes like `loud', `smooth', `fast',
more than 70 new speech features are developed. A set of basic speech features like pitch,
loudness and speech speed are combined together with these new features in a feature set,
one set per utterance. A neural network classifier is used to evaluate the significance of these
features by creating a speaker recognition system and analysing the behaviour of
successfully trained single-speaker networks. An in-depth analysis of network weights
allows a rating of significance and feature contribution. A subjective listening experiment
validates and confirms the results of the neural network analysis.
The work starts with an extended sentence analysis; ten sentences are uttered by 630 speakers.
The extraction of 100 speech features is outlined and a 100-element feature vector for
each utterance is derived. Some features themselves and the methods of analysing them have
been used elsewhere, for example pitch, sound pressure level, spectral envelope, loudness,
speech speed and glottal-to-noise excitation. However, more than 70 of the 100 features are
derivatives of these basic features and have not yet been described and used before in the
speakerr ecognition research,e speciallyyn ot within a rating of feature significance. These derivatives
include histogram, 3`d and 4 moments, function approximation, as well as other
statistical analyses applied to the basic features.
The first approach assessing the significance of features and their possible use in a
recognition system is based on a probability analysis. The analysis is established on the
assumption that within the speaker's ten utterances' single feature values have a small
deviation and cluster around the mean value of one speaker. The presented features indeed
cluster into groups and show significant differences between speakers, thus enabling a clear
separation of voices when applied to a small database of < 20 speakers. The recognition and
assessment of individual feature contribution jecomes impossible, when the database is
extended to 200 speakers. To ensure continous vplidation of feature contribution it is
necessary to consider a different type of classifier.
These limitations are overcome with the introduction of neural network classifiers. A
separate network is assigned to each speaker, resulting in the creation of 630 networks. All
networks are of standard feed-forward backpropagation type and have a 100-input, 20-
hidden-nodes, one-output architecture. The 6300 available feature vectors are split into a
training, validation and test set in the ratio of 5-3-2. The networks are initially trained with
the same 100-feature input database. Successful training was achieved within 30 to 100
epochs per network. The speaker related to the network with the highest output is declared as
the speaker represented by the input. The achieved recognition rate for 630 speakers is
-49%. A subsequent preclusion of features with minor significance raises the recognition
rate to 57%.
The analysis of the network weight behaviour reveals two major pointsA definite ranking order of significance exists between the 100 features. Many of the newly
introduced derivatives of pitch, brightness, spectral voice patterns and speech speed
contribute intensely to recognition, whereas feature groups related to glottal-to-noiseexcitation
ratio and sound pressure level play a less important role. The significance of
features is rated by the training, testing and validation behaviour of the networks under data
sets with reduced information content, the post-trained weight distribution and the standard
deviation of weight distribution within networks. The findings match with results of a
subjective listening experiment.
As a second major result the analysis shows that there are large differences between speakers
and the significance of features, i. e. not all speakers use the same feature set to the same
extent. The speaker-related networks exhibit key features, where they are uniquely
identifiable and these key features vary from speaker to speaker. Some features like pitch are
used by all networks; other features like sound pressure level and glottal-to-noise excitation
ratio are used by only a few distinct classifiers. Again, the findings correspond with results
of a subjective listening experiment.
This thesis presents more than 70 new features which never have been used before in speaker
recognition. A quantitative ranking order of 100 speech features is introduced. Such a
ranking order has not been documented elsewhere and is comparatively new to the area of
speaker recognition. This ranking order is further extended and describes the amount to
which a classifier uses or omits single features, solely depending on the characteristics of the
voice sample. Such a separation has not yet been documented and is a novel contribution.
The close correspondence of the subjective listening experiment and the findings of the network
classifiers show that it is plausible to model the behaviour of human speech recognition
with an artificial neural network. Again such a validation is original in the area of speaker