Multi-Band Excited WaveNet -- demo

Axel Roebel, Frederik Bous -- Analysis synthesis team, UMR 9912 STMS, IRCAM, CNRS, Sorbonne University

This page contains audio examples of the Multi-Band Excited WaveNet (MBExWN), a Neural Vocoder for singing and speech re-synthesis from Mel spectrograms.
Note that, in contrast to most existing research the objective of the MBExWN Vocoder is not only high speech quality but transparent reproduction of the original speech signal from the Mel spectrogram.
Note: This page has been tested with Chrome and Firefox. In case of problems with the audio please try one of those.

Table of content:


Speech

The MBExWN speaker model MWU_SP demonstrated below has been trained using a pulse forming WaveNet with 360 feature channels. The training database is an aggregation of the VCTK, PTDB and Att-HACK databases. The database contains about 45h of speech of 150 speakers. From these there are only 14h covered by 21 French speakers, the rest are English speakers.


Seen speakers

In this section we provide demo examples of the multi speaker model using unseen phrases of the training database:

Sample
Original
MWU_SP(prop.)
VCTK-p246-male_021
/
/
VCTK-p266-female_043
/
/
ptdb-F10-female_si2162
/
/
ptdb-M05-male_si1222
/
/

Unseen speaker / single speaker baselines

In this section we compare our proposed model MWU_SP against different single speaker neural vocoder baseline as follows:

MWS_LJ
An MBExWN with 240 feature channels in the pulse forming WaveNet that is trained exclusively on the LJSpeech dataset. The experiment is performed using the validation set this network represents the optimal performance of the model structure for single speaker cases.
MMG
A multi-band melgan according to this paper trained as well exclusively on the LJSpeech dataset. This model represents an example of the state of the art. Note that the multi-band melgan has about 4 times less parameters and is about 7 times faster than MWS_LJ. As the current study does not aim at single speaker models we did not optimize the model size here.
MWU_SP_v
The multi speaker MBExWN trained exactly like MWU_SP but without the dedicated component for generating the vocal tract transfer function (VTF) in the model. This experiment demonstrates the performance gain achieved by means of adding the dedicated VTF component.
Sample
Original
MWU_SP(prop.)
MWS_LJ
MMG
MWU_SP_v
LJ001-0022
/
/
/
/
/
LJ019-0101
/
/
/
/
/
LJ026-0021
/
/
/
/
/
LJ036-0141
/
/
/
/
/
LJ049-0189
/
/
/
/
/

Unseen, out of domain speakers / multi speaker baseline

Thanks to the kind permission by Won Jang in this section we compare our proposed model against the universal melgan. We have selected a few original audio sources from the unseen domains examples of the demonstration page of the Universal MelGAN and compare our resynthesis with the one obtained by the Universal MelGAN model taken as well from the demonstration page. Notation as follows:

UMG
The Universal MelGAN trained on the LJSpeech and the LibriTTS train-clean-360 dataset.

Unseen speakers

Sample
Original
MWU_SP(prop.)
Universal MelGAN
Male
/
/
/
Female
/
/
/

Unseen speaker and expressivity

Sample
Original
MWU_SP(prop.)
Universal MelGAN
Expressive 1
/
/
/
Expressive 2
/
/
/
Expressive 3
/
/
/
Expressive 4
/
/
/

Unseen language

Sample
Original
MWU_SP(prop.)
Universal MelGAN
Spanish
/
/
/
Chinese
/
/
/
Japanese
/
/
/
German
/
/
/

Singing

The MBExWN multi singer model MWU_SI using the same parameterisation as the MWU_SP. It has been trained on an aggregation of the NUS-48E (singing signals only), SVDB, PJS, JVS, VocalSet, Tohoku Kiritan datasets. The training database further included two internal databases: the ChaNTeR dataset and a Baroque singing databases. In total there are 27 hours of singing by 136 singers, but from these 136 singers more than 100 are Japanese. In the database there are 4h French singing, 4h Italian singing, 12h English singing and 5h Japanese singing. The styles that are included are Pop and Classical singing.


Seen singers

In this section we provide demo examples of the multi singer model using unseen phrases of the training database. Note that due to the extreme pitch range the speaker model will not work here and can therefore not be compared.

Sample
Original
MWU_SI(prop.)
RT-ChaNTeR
/
/
Countertenor
/
/
Soprano
/
/
Child
/
/

Unseen, out of domain singer / single singer baselines, multi speaker baseline

In this section we compare the proposed model trained on the multi singer dataset (not including byzantine singing) against different single singer neural vocoder baselines trained on the DIMIRTIS Byzantine singing database. This database contains 2h50min of Byzantine singing of a single singer. The different baseline models are denoted as follows:

MWS_DI
An MBExWN with 240 feature channels in the pulse forming WaveNet trained exclusively on the DIMITRIS dataset. The experiment is performed using the validation set this of the DIMITRIS dataset and represents the optimal performance of the model structure for single singer cases.
MMG
A multi-band melgan according to this paper trained as well exclusively on the DIMITRIS dataset. This model represents an example of the state of the art. Note that the multi-band melgan has about 4 times less parameters and is about 7 times faster than MWS_DI. As the current study does not aim at single speaker models we did not optimize the model size here.
MWU_SP
The multi speaker MBExWN trained on the multi speaker (VCTK/PTDB/Att-HACK) dataset. This case shows the potential differences between the multi singer and multi speaker models for Byzantine singing, which can be considered out of domain for both MBExWN models.
Sample
Original
MWU_SI(prop.)
MWS_DI
MMG
MWU_SP
1-17
/
/
/
/
/
1-18
/
/
/
/
/
5-79
/
/
/
/
/
5-87
/
/
/
/
/

Unseen, out of domain singers / multi speaker baseline

In this section we compare the MBExWN models trained on multi singer and speaker databases when applied to other unseen singers. There are pop singing tracks extracted from the CCMixter dataset, the metal singing extracts are cut from youtube recordings.
Note that

Pop singing

Sample
Original
MWU_SI(prop.)
MWU_SP
BKS - Too Much
/
/
/
Javolenus - King Henry
/
/
/
Louis Cressy Band - GoodTime
/
/
/
Mu - Too Bright
/
/
/
Signe Jakobsen - What Have You Done To Me
/
/
/
geertveneklaas - Blue Boy
/
/
/
stellarartwars - On A Silent Night
/
/
/

Metal singing

Sample
Original
MWU_SI(prop.)
MWU_SP
Metal B
/
/
/
Metal F
/
/
/
Metal M
/
/
/
Metal N
/
/
/

Some of the sounds used in this test are under Copyright © 2021 Ircam, Institut de recherche et coordination acoustique/musique.