Use this URL to cite or link to this record in EThOS: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.741629
Title: Improving automatic speech recognition transcription through signal processing
Author: Shah, Afnan Arafat
ISNI:       0000 0004 7224 9147
Awarding Body: University of Southampton
Current Institution: University of Southampton
Date of Award: 2017
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Abstract:
Automatic speech recognition (ASR) in the educational environment could be a solution to address the problem of gaining access to the spoken words of a lecture for many students who find lectures hard to understand, such as those whose mother tongue is not English or who have a hearing impairment. In such an environment, it is difficult for ASR to provide transcripts with Word Error Rates (WER) less than 25% for the wide range of speakers. Reducing the WER reduces the time and therefore cost of correcting errors in the transcripts. To deal with the variation of acoustic features between speakers, ASR systems implement automatic vocal tract normalisation (VTN) that warps the formants (resonant frequencies) of the speaker to better match the formants of the speakers in the training set. The ASR also implements automatic dynamic time warping (DTW) to deal with variation in the speaker’s rate of speaking, by aligning the time series of the new spoken words with the time series of the matching spoken words of the training set. This research investigates whether the ASR’s automatic estimation of VTN and DTW can be enhanced through pre-processing the recording by manually warping the formants and speaking rate of the recordings using sound processing libraries (Rubber Band and SoundTouch) before transcribing the pre-processed recordings using ASR. An initial experiment, performed with the recordings of two male and two female speakers, showed that pre-processing the recording could improve the WER by an average of 39.5% for male speakers and 36.2% for female speakers. However the selection of the best warp factors was achieved through an iterative ‘trial and error’ approach that involved many hours calculating the word error rate for each warp factor setting. Finding a more efficient approach for selecting the warp factors for pre-processing was then investigated. The second experiment investigated the development of a modification function using, as its training set, the best warp factors from the ‘trial and error’ approach to estimate the modification percentage required to improve the WER of a recording. A modification function was found that on average improved the WER by 16% for female speakers and 7% for male speakers.
Supervisor: Wald, Michael Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.741629  DOI: Not available
Share: