Spoken language is one of the first form of communication our species developed, and though no one knows how and when spoken language developed, it has been a defining characteristic of our species. With the advent of the computer age, scientists have been trying to program computers to understand and parse speech and to model human understanding of speech but success has been somewhat limited, computers being far from accurately modelling human perception of speech especially in noisy environments that are present everywhere in real life.
Speech intelligibility is the measure of how comprehensible speech is, and it is affected by many things like the quality of the speech, the strength of the speech signal, acoustics of the environment and background noise. A very famous phenomenon in hearing science called “the cocktail party effect” gives an great example on how powerful our brains are: say you are at a cocktail party, there are a lot of people around you talking to each other, there is music in the background and so on; you can ‘tune in’ your attention to just one of the speakers and ‘tune out’ all the other speakers and noise. Experiments have shown that people with only one functioning ear have much more difficulty filtering out other sources of sound than normal hearing people, suggesting that having two ears - being binaural - helps a lot in understanding speech in realistic environments with many sources of noise.
Being able to model the way we humans process sound has a lot of benefits, both technological - think about building and tuning concert halls - and medical applications. Current generation of hearing aids are mostly monaural and work by just amplifying the sound they receive without doing any time of binaural processing that normal hearing people do in ‘cocktail party’ type situations. Another application would be in cochlear implants, they rely on an external audio processor that tries to filter out speech from nonspeech sounds before passing the signal to the implant and this does not work as well as the human auditory system.
Current computational models of speech intelligibility are mostly single channel and perform poorly in predicting human speech intelligibility, especially in noisy environments. Another problem with current models is that they are ‘macroscopic’ - that is to say they only give out a certain intelligibility percentage for the whole speech sequence and this is clearly not the case in real life where you have sources of noise that are very variable. A proposed solution to this is microscopic modelling - trying to break down speech into small tokens and looking at the intelligibility there.
My research topic is about developing a binaural microscopic model of speech intelligibility that will attempt to model human speech intelligibility by using binaural processing and seeing how much better it is at modelling human intelligibility compared to traditional monaural models. My current approach towards this issue is using existing models and extending them into the binaural domain. More specifically, I have started using a microscopic model named ‘the glimpsing model’ as a backend that does the basic speech recognition on each separate channel and applying an binaural version of the Equalisation-Cancellation model as a frontend that will do the binaural processing, the ‘tuning in’ part of human hearing. I am currently using speech data that is already recorded and available and training and testing models on it.
Main host institution: University of Sheffield
Second host institution: Technical University of Denmark
Industry partner: Philips Research Laboratories Eindhoven