An Integrated Deep Clustering-Based System for Speaker Count Agnostic Speech Separation

This paper proposes to unify two deep-learning methods, CountNet and Deep Clustering, designed for speaker count and separation respectively, in order to perform speaker count agnostic speech separation. Two approaches are compared, where the speaker count estimation and separation subnetworks are either trained separately or jointly. Training and evaluation are conducted on a tailored dataset WSJ0-Kmix, which is an extension of the WSJ0-2mix and WSJ0-3mix datasets for an arbitrary number of speakers. Results show that both systems are capable of separating up to four sources without prior information on the number of speakers. Furthermore, the joint approach is able to perform similarly to its separate counterpart while using fewer parameters and simplifying the overall architecture.