This article addresses the problem of multichannel music separation. We propose a framework where the source spectra are estimated using deep neural networks (DNNs) and combined with spatial covariance matrices to encode the source spatial characteristics. The parameters are estimated in an iterative expectation-maximization (EM) fashion and used to derive a multichannel Wiener filter. We evaluate the proposed framework for the task of music separation on a large dataset. Experimental results show that the method we describe performs consistently well in separating singing voice and other instruments from realistic musical mixtures.
download the DSD100 database here (16gb)
Multitrack HTML player by binarymind. Freely download and use it yourself here
antoine (dot) liutkus (at) inria (dot) fr