Early Stage Researcher: Deepak Baby
Main host institution: Katholieke Universiteit Leuven
Main Host Supervisor: Hugo Van hamme
Second host institution: Tampere University of Technology
Second Host Supervisor: Tuomas Virtanen
Industry partner: Nuance Communication International bvba
Automatic speech recognition (ASR) enables a computer or an electronic machine to identify spoken words and to perform the corresponding action. With the popularity of electronic gadgets these days, it is quite common to have a built-in ASR system which can automate actions like search, texting, navigation etc. just by talking to the device. ASR has also applications in the military, for people with disabilities, etc..
For an ASR system to work properly, it has to recognize correctly what the user is saying. But even after decades of research, the performance of the ASR systems is still far inferior when compared to that of humans especially in the presence of background noise. Most of these algorithms work reasonably well in unnatural laboratory conditions which is much different from a recording taken from a realistic environment. The goal of this project is to incorporate more sophisticated knowledge of human hearing into the ASR framework so that we can make the system perform more human like and to better match the realistic set-up.
When it comes to recognizing speech or images, the brain tries to recognize a pattern from the uttered speech or the seen visual. For an electronic machine to do ASR, it needs to extract some characteristic properties or features from the noisy speech and then recognize some learned pattern which eventually leads the system to the underlying words. In a noisy environment, what degrades the performance of an ASR system is that the learned feature patterns will be corrupted by the addition of noise which leads to erroneous results. Hence it is found that cleaning or enhancing the corrupted features by removing the artefacts introduced by noise is beneficial for an improved performance.
For enhancing the features, we first need to train the system to differentiate between speech and noise. For this, we extract speech and noise feature patterns and store them as ”exemplars” and then try to represent the feature patterns obtained from the noisy data as the sum of speech and noise exemplars. The part corresponding to speech is then separated out which is fed to the ASR system. The performance of the algorithm thus depends on how well the speech and noise features are differentiated. So in this work, the goal is to make use of the knowledge about human auditory processing for feature extraction so that a better separation of speech from noise can be obtained. The proposed methods will be evaluated using available benchmarks for comparing various ASR systems.
The ultimate outcome of this research is thus to improve the performance of the current ASR systems in noisy environments drawing on knowledge about human auditory processing, which will also eventually turn out to be an integrated computational model for human hearing as more and more ideas are incorporated.