This page contains audio examples of the Multi-Band Excited WaveNet (MBExWN), a Neural Vocoder for singing and speech
re-synthesis from Mel spectrograms.
Note that, in contrast to most existing research the objective of the MBExWN Vocoder is not only high speech quality
but transparent reproduction of the original speech signal
from the Mel spectrogram.
Note: This page has been tested with Chrome and Firefox. In case of problems with the audio please try one of those.
The MBExWN speaker model MWU_SP demonstrated below has been trained using a pulse forming WaveNet with 360 feature channels. The training database is an aggregation of the VCTK, PTDB and Att-HACK databases. The database contains about 45h of speech of 150 speakers. From these there are only 14h covered by 21 French speakers, the rest are English speakers.
In this section we provide demo examples of the multi speaker model using unseen phrases of the training database:
In this section we compare our proposed model MWU_SP against different single speaker neural vocoder baseline as follows:
Thanks to the kind permission by
The MBExWN multi singer model MWU_SI using the same parameterisation as the MWU_SP. It
has been trained on an aggregation of the NUS-48E
(singing signals only),
SVDB,
PJS,
JVS,
VocalSet,
Tohoku Kiritan datasets.
The training database further included two internal databases: the
ChaNTeR dataset
and a Baroque singing databases. In total there are 27
hours of singing by 136 singers, but from these 136
singers more than 100 are Japanese. In the database there
are 4h French singing, 4h Italian singing, 12h English singing and 5h Japanese
singing. The styles that are included are Pop and
Classical singing.
In this section we provide demo examples of the multi singer model using unseen phrases of the training database. Note that due to the extreme pitch range the speaker model will not work here and can therefore not be compared.
In this section we compare the proposed model trained on the multi singer dataset (not including byzantine singing)
against different single singer neural vocoder baselines trained on
the DIMIRTIS Byzantine singing database. This database contains 2h50min of Byzantine
singing of a single singer. The different baseline models are denoted as follows:
In this section we compare the MBExWN models trained on multi singer and speaker databases
when applied to other unseen singers. There are pop singing tracks
extracted from the CCMixter dataset,
the metal singing extracts are cut from youtube recordings.
Note that
Some of the sounds used in this test are under Copyright © 2021 Ircam, Institut de recherche et coordination acoustique/musique.