Training means adapting the various probabilities and probability distributions
governing the HMM to one or more example performances such as to optimise the
quality of the follower. Two different things can be trained: the transition
probabilities between the states of the markov chain, and the probability
density functions (PDFs) of the observation likelihoods. While the former is
applicable for audio and Midi, but needs much example data, especially with
errors, the latter can be done for audio by a statistical analysis of the
features to derive the PDFs, which essentially perform a mapping from a feature
to a probability of attack or sustain or rest.
5.1 State of Affairs
The model uses thresholded exponential PDFs to calculate the observation
likelihoods, because they are better adapted to the acoustic features.
The PDF parameters were determined from analysing different sounds, in
particular trumpet, clarinet, saxophone, and flute, of the
Studio On Line sound database in Matlab. However, because of
the FFT window bug described in section 4, the parameters did not
work and had to be adjusted by hand.
Hand-chosen parameters worked very well, but consequently the
extension of the model to include new features was difficult.
5.2 What has been done
Reference alignment for audio files were prepared by Vincent Goudard: For the
1998 CD recording of En Echo, synchronised Midi-scores (from the
Audio-to-Midi function in an old version of StudioVision) were
available, the note-positions were verified manually. For the new recordings,
a pitch tracker in jMax was used to determine the notes, but the note length
was edited manually.
Example patches with all recordings for one piece and their reference
alignment were prepared. (It should be possible to mix some recording of
orchestra noise to the pieces, with different levels. The patches would be
best in matrix form: lines = sections, columns = different recordings, takes,
singers.)
The ``training'' of observation probabilities is done by off-line analysis in
Matlab of feature data written to SDIF files by suiviaudiofeat,
according to a text representation of the model dumped by
suivimakeref (see section 7 for a closer description of these
objects):
Automatic identification (from the score) of the state classes each note
belongs to (see figure 1 for the class hierarchy in tree form):
Attack states: A-new (after a rest), A-legato
(after a note), A-same (after the same note)
Sustain states: S-normal (normal), S-forte (note is
f...fff), S-piano
(p...ppp, but see below), S-fricative (fricative
for voice)
Rest states: The rests should also distinguish the pause length,
so we'd have:
R-same-note, R-legato, R-detached, where R-detached is a superclass
that is eventually decomposed into R-after-forte (last note was
f...fff), R-detached-normal, R-after-piano (last
note was p...ppp), if we have reliable dynamics
information in the note velocity (the dynamics-dependent states are not
used, yet)
The histogram of values for each class for each feature is computed over
one or more soundfiles (see figure 2 showing the histograms and
statistical parameters (average, standard deviation, threshold, and µ) of
the first 10 seconds of section Riviere of En Echo of the
features log-energy (len), delta log-energy (dle), peak structure match
(psm), delta peak structure match (dpsm), cepstral difference (cepd), and
zero crossing rate (zcr), for the state classes attack (a), sustain (s),
release (r), note, and rest).
The threshold and mean of the exponential PDF is determined by looking for
the maximum frequence in the histogram. For delta features, the threshold is
based on the standard deviation.
Figure 1: State tree
Figure 2: Feature histograms of the beginning of section Riviere
of En Echo
5.3 What is to be done
The automatic determination of PDF parameters is still experimental. There is
no literature on it, as we know of, because usual HMMs use gaussian
distributions. We must robustly capture the human way of determining the
parameters by injecting some prior knowledge about the features.
The analysis and statistics can be programmed in a jMax or Max/MSP object to be
used to adapt parameters to a specific room acoustics, or a new instrument,
without needing to run Matlab.
Then of course a real iterative training (supervised by providing a reference
alignment or unsupervised starting from the already good alignment to date) of
the transition and observation probabilities is necessary to increase the
robustness of the follower even more. This training could adapt to the
``style'' of a certain singer or musician.