Multi-Band Excited WaveNet -- demo

Axel Roebel, Frederik Bous -- Analysis synthesis team, UMR 9912 STMS, IRCAM, CNRS, Sorbonne University

This page contains audio examples of the Multi-Band Excited WaveNet (MBExWN), a Neural Vocoder for singing and speech re-synthesis from Mel spectrograms.
Note that, in contrast to most existing research the objective of the MBExWN Vocoder is not only high speech quality but transparent reproduction of the original speech signal from the Mel spectrogram.
Note: This page has been tested with Chrome and Firefox. In case of problems with the audio please try one of those.

Table of content:

Speech
Singing

Speech

The MBExWN speaker model MWU_SP demonstrated below has been trained using a pulse forming WaveNet with 360 feature channels. The training database is an aggregation of the VCTK, PTDB and Att-HACK databases. The database contains about 45h of speech of 150 speakers. From these there are only 14h covered by 21 French speakers, the rest are English speakers.

Seen speakers

In this section we provide demo examples of the multi speaker model using unseen phrases of the training database:

Sample

Original

MWU_SP(prop.)

VCTK-p246-male_021

VCTK-p266-female_043

ptdb-F10-female_si2162

ptdb-M05-male_si1222

Unseen speaker / single speaker baselines

In this section we compare our proposed model MWU_SP against different single speaker neural vocoder baseline as follows:

MWS_LJ

An MBExWN with 240 feature channels in the pulse forming WaveNet that is trained exclusively on the LJSpeech dataset. The experiment is performed using the validation set this network represents the optimal performance of the model structure for single speaker cases.

MMG

A multi-band melgan according to this paper trained as well exclusively on the LJSpeech dataset. This model represents an example of the state of the art. Note that the multi-band melgan has about 4 times less parameters and is about 7 times faster than MWS_LJ. As the current study does not aim at single speaker models we did not optimize the model size here.

MWU_SP_v

The multi speaker MBExWN trained exactly like MWU_SP but without the dedicated component for generating the vocal tract transfer function (VTF) in the model. This experiment demonstrates the performance gain achieved by means of adding the dedicated VTF component.

Sample

Original

MWU_SP(prop.)

MWS_LJ

MMG

MWU_SP_v

LJ001-0022

LJ019-0101

LJ026-0021

LJ036-0141

LJ049-0189

Unseen, out of domain speakers / multi speaker baseline

Thanks to the kind permission by Won Jang in this section we compare our proposed model against the universal melgan. We have selected a few original audio sources from the unseen domains examples of the demonstration page of the Universal MelGAN and compare our resynthesis with the one obtained by the Universal MelGAN model taken as well from the demonstration page. Notation as follows:

UMG

The Universal MelGAN trained on the LJSpeech and the LibriTTS train-clean-360 dataset.

Unseen speakers

Sample

Original

MWU_SP(prop.)

Universal MelGAN

Male

Female

Unseen speaker and expressivity

Sample

Original

MWU_SP(prop.)

Universal MelGAN

Expressive 1

Expressive 2

Expressive 3

Expressive 4

Unseen language

Sample

Original

MWU_SP(prop.)

Universal MelGAN

Spanish

Chinese

Japanese

German

Singing

The MBExWN multi singer model MWU_SI using the same parameterisation as the MWU_SP. It has been trained on an aggregation of the NUS-48E (singing signals only), SVDB, PJS, JVS, VocalSet, Tohoku Kiritan datasets. The training database further included two internal databases: the ChaNTeR dataset and a Baroque singing databases. In total there are 27 hours of singing by 136 singers, but from these 136 singers more than 100 are Japanese. In the database there are 4h French singing, 4h Italian singing, 12h English singing and 5h Japanese singing. The styles that are included are Pop and Classical singing.

Seen singers

In this section we provide demo examples of the multi singer model using unseen phrases of the training database. Note that due to the extreme pitch range the speaker model will not work here and can therefore not be compared.

Sample

Original

MWU_SI(prop.)

RT-ChaNTeR

Countertenor

Soprano

Child

Unseen, out of domain singer / single singer baselines, multi speaker baseline

In this section we compare the proposed model trained on the multi singer dataset (not including byzantine singing) against different single singer neural vocoder baselines trained on the DIMIRTIS Byzantine singing database. This database contains 2h50min of Byzantine singing of a single singer. The different baseline models are denoted as follows:

MWS_DI

An MBExWN with 240 feature channels in the pulse forming WaveNet trained exclusively on the DIMITRIS dataset. The experiment is performed using the validation set this of the DIMITRIS dataset and represents the optimal performance of the model structure for single singer cases.

MMG

A multi-band melgan according to this paper trained as well exclusively on the DIMITRIS dataset. This model represents an example of the state of the art. Note that the multi-band melgan has about 4 times less parameters and is about 7 times faster than MWS_DI. As the current study does not aim at single speaker models we did not optimize the model size here.

MWU_SP

The multi speaker MBExWN trained on the multi speaker (VCTK/PTDB/Att-HACK) dataset. This case shows the potential differences between the multi singer and multi speaker models for Byzantine singing, which can be considered out of domain for both MBExWN models.

Sample

Original

MWU_SI(prop.)

MWS_DI

MMG

MWU_SP

1-17

1-18

5-79

5-87

Unseen, out of domain singers / multi speaker baseline

In this section we compare the MBExWN models trained on multi singer and speaker databases when applied to other unseen singers. There are pop singing tracks extracted from the CCMixter dataset, the metal singing extracts are cut from youtube recordings.
Note that

a few of these examples have quite a strong reverb, which is a problem for the algorithms that were not trained with reverberated examples.
very rough voices and screaming are not part of the training database and are clearly a challenge for the model.