A. de Cheveigné
Time-domain comb filtering for speech separation
Abstract:The auditory system uses differences that occur in the harmonic structure of concurrent sounds, such as speech, to separate them. This is one aspect of what is known as the "cocktail-party effect".
Several models have been proposed to explain how this is done (see de Cheveigné 1993 for a review). They usually assume that signals to be separated are purely harmonic. Psychoacoustic and physiological experiments designed to test them likewise employ such stimuli. However real speech is often very imperfectly harmonic, and it is not clear how well the models will work in that case.
In order to determine how well a model can perform its task on "real" speech, I implemented its basic processing scheme as a front-end to a speech recognition system and measured the effect on the rates in a recognition task. To the extent that this processing reflects that of the perception model, and that the task is typical of the perception of speech in "real" situations, the results should give some indication of the plausibility of the model.
It is stressed that the aim is not to develop a speech separation system. The results might however be of some use in designing such a system. I also do not wish to reproduce quantitatively the recognition rates obtained in psychoacoustic experiments. To do so would require postulating many details of the physiological implementation, and thus obscuring the essential features of the model. Instead, I wish to find out if its processing principle, implemented in some form, can be effective in tasks typical of the "real world".