Extracting room acoustic parameters from received speech signals using artificial neural networks
Quantitative room acoustics over a century has accumulated a knowledge base
centred around objective acoustic parameters. Realistic and accurate measurements
are essential in room acoustics. Occupied measurements are difficult to undertake
with current technology, yet it is well established that occupancy changes acoustics.
For this reason, new measurement techniques are sought. This thesis concerns anew,
machine learning based approach for measuring room acoustic parameters, which is
particularly useful for occupied in-situ measurements.
A set of artificial neural networks, associated pre-processors and machine learning
regimes are developed to extract Reverberation Time (RT), Early Decay Time (EDT)
and Speech Transmission Index (STI) from received speech signals. Utilising
naturalistic sounds - speech - as excitations, the developed methods circumvent the
use of unpleasant noisy test signals and therefore measurements can be made in
occupied spaces in a non-invasive fashion. Given the non-invasive nature and
achievable accuracy, the new methods can facilitate occupied measurements,
providing an alternative to traditional methods to better quantify acoustics of spaces
where speech communication is important.
Much of the development work of the neural network methods focuses on the preprocessors
which produce data reduced and pre-conditioned signals for the neural
networks. Two different speech scenarios, separate utterances and continuous running
speech are considered, leading to the development of four major neural network
1. Time domain method to extract RTIEDT from separate utterances.
2. Straightforward FFT method to extract STI from short-time speech.
3. Frequency domain method to extract STI from long-time running speech.
4. Frequency domain method to extract RTIEDT from long-time running speech.
These methods are all based on supervised learning. Unsupervised models,
representing another important class of neural networks, are also investigated in the
context of this study and are found useful as pre-processors.
The model development and validations are carried out through computer simulations.
Results show that better than O.ls and 0.02 resolutions in reverberation time and STI
extractions are achievable based on a "one-net-one-speech" machine learning regime:
a neural network trains on a particular anechoic speech to extract a designated
objective parameter under the excitation of that speech. Neural network systems
extracting acoustic parameters from received arbitrary speech signals without using
prior knowledge of the speech stimuli, termed source independent measurements, are
explored. Although the achieved accuracy is not as good as that of the standard
methods and the developed neural network methods on the one-net-one-speech basis,
the source independent extraction is potentially more useful in practical systems.
Improving the accuracy of the source independent measurements and extending the
developed methods to music signals are seemingly the most significant further work
of this study.