Bio-inspired Encoders for Temporal Convolutional Networks performing Speech Enhancement and Separation

Available here

Human evolution has shaped and optimized the auditory system to analyse natural sounds and especially speech. Studies have led to interesting discoveries about the early stages of the auditory processing, highlighting integration mechanisms explaining the impressive human ability to perform tasks such as Source Separation (the famous ’Cocktail-Party effect’) or Speech Enhancement (removal of interfering noise and reverberant components).

For hearing-impaired listeners, those tasks are much harder to operate, explaining why it is of high interest for research to understand, model and find solutions to reproduce the operations found in normal listeners. Recently, Deep Neural Networks have been introduced in Speech Processing fields and have shown promising results regarding tasks as Speech Separation and Enhancement, by simply training on speech data, without further assumptions regarding speech structure for most of those solutions. One question naturally arises: do Neural Networks optimize their speech analysis in the same way as humans have done through evolution?

A recent example [1, 2] has shown that, when given the opportunity of learning a specific filterbank for speech analysis, a Neural Network converged to a solution which is strikingly similar to the best model we have of the human cochlea, that is, the Gammatone filterbank [3]. One practical interest of that discovery is notably that we can then freeze the filters in the Neural Network encoder to reduce the number of free parameters, enabling easier training without sacrificing optimal solutions.

Motivated by this result, we investigate another mechanism of the auditory system which is phase-locking, that is, the ability of some neurons to synchronize their firing activity to the stimulus frequency and accordingly deliver a time-coded representation of the given stimulus. We propose an encoder model enabling phase-locking to be learnt in a Neural Network, and look at whether the resulting network is using this model to learn a behaviour similar to phase-locking, or to derive another optimal strategy.

In a further step of designing optimal encoders for Neural Networks performing on speech data, we add phase-shift parameters in the analysis filterbank and train the network to learn those parameters. We show that the network converges to a solution where the maximal discriminative power is obtained in order to produce the most diverse representations out of redundant information, justifying the intuition in [2].

[1] : Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, pp. 1256–1266, 2019.
[2] : D. Ditter and T. Gerkmann, “A multi-phase gammatone filterbank for speech separation via tasnet,” ICASSP 2020, pp. 36–40.
[3] : R. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice, “An efficient auditory filterbank based on the gammatone function,” 1988.