Towards end-to-end F0 voice conversion
based on Dual-GAN with convolutionnal wavelet kernels

IRCAM, CNRS, Sorbonne Université
STMS Lab, Paris, France

Clément LE MOINE, Nicolas OBIN and Axel ROEBEL

Dual-GAN ugraded with PreNet based convolutionnal wavelet kernels

The proposed system is a end-to-end framework for the F0 transformation in the context of expressive voice conversion. A single neural network is proposed, in which a first module is used to learn F0 representation over different temporal scales and a second adversarial module is used to learn the transformation from one attitude to another. The first module is composed of a convolution layer with wavelet kernels so that the various temporal scales of F0 variations can be efficiently encoded. The single decomposition/transformation network allows to learn in a end-to-end manner the F0 decomposition that are optimal with respect to the transformation, directly from the raw F0 signal.

Paper

Clément LE MOINE, Nicolas OBIN, Axel ROEBEL, "Towards end-to-end F0 voice conversion based on Dual-GAN with convolutionnal wavelet kernels" ResearchGate

Voice attitudes conversion examples

1. Friendy - Distant

Source Speech: friendly	Target Speech: distant	Converted Speech: friendly to distant

Source Speech: distant	Target Speech: friendly	Converted Speech: distant to friendly

2. Friendly - Seductive

Source Speech: friendly	Target Speech: seductive	Converted Speech: friendly to seductive

Source Speech: seductive	Target Speech: friendly	Converted Speech: seductive to friendly

3. Distant - Seductive

Source Speech: distant	Target Speech: seductive	Converted Speech: distant to seductive

Source Speech: seductive	Target Speech: distant	Converted Speech: seductive to distant

4. Distant - Dominant

Source Speech: distant	Target Speech: dominant	Converted Speech: distant to dominant

Source Speech: dominant	Target Speech: distant	Converted Speech: dominant to distant

5. Dominant - Seductive

Source Speech: dominant	Target Speech: seductive	Converted Speech: dominant to seductive

Source Speech: seductive	Target Speech: dominant	Converted Speech: seductive to dominant

6. Friendly - Dominant

Source Speech: friendly	Target Speech: dominant	Converted Speech: friendly to dominant

Source Speech: dominant	Target Speech: friendly	Converted Speech: dominant to friendly

Towards end-to-end F0 voice conversion based on Dual-GAN with convolutionnal wavelet kernels IRCAM, CNRS, Sorbonne Université STMS Lab, Paris, France

IRCAM, CNRS, Sorbonne Université STMS Lab, Paris, France