Previous Contents Next

5   Training

Training means adapting the various probabilities and probability distributions governing the HMM to one or more example performances such as to optimise the quality of the follower. Two different things can be trained: the transition probabilities between the states of the markov chain, and the probability density functions (PDFs) of the observation likelihoods. While the former is applicable for audio and Midi, but needs much example data, especially with errors, the latter can be done for audio by a statistical analysis of the features to derive the PDFs, which essentially perform a mapping from a feature to a probability of attack or sustain or rest.

5.1   State of Affairs

The model uses thresholded exponential PDFs to calculate the observation likelihoods, because they are better adapted to the acoustic features.

The PDF parameters were determined from analysing different sounds, in particular trumpet, clarinet, saxophone, and flute, of the Studio On Line sound database in Matlab. However, because of the FFT window bug described in section 4, the parameters did not work and had to be adjusted by hand. Hand-chosen parameters worked very well, but consequently the extension of the model to include new features was difficult.

5.2   What has been done

Reference alignment for audio files were prepared by Vincent Goudard: For the 1998 CD recording of En Echo, synchronised Midi-scores (from the Audio-to-Midi function in an old version of StudioVision) were available, the note-positions were verified manually. For the new recordings, a pitch tracker in jMax was used to determine the notes, but the note length was edited manually.

Example patches with all recordings for one piece and their reference alignment were prepared. (It should be possible to mix some recording of orchestra noise to the pieces, with different levels. The patches would be best in matrix form: lines = sections, columns = different recordings, takes, singers.)

The ``training'' of observation probabilities is done by off-line analysis in Matlab of feature data written to SDIF files by suiviaudiofeat, according to a text representation of the model dumped by suivimakeref (see section 7 for a closer description of these objects):



Figure 1: State tree







Figure 2: Feature histograms of the beginning of section Riviere of En Echo


5.3   What is to be done

The automatic determination of PDF parameters is still experimental. There is no literature on it, as we know of, because usual HMMs use gaussian distributions. We must robustly capture the human way of determining the parameters by injecting some prior knowledge about the features.

The analysis and statistics can be programmed in a jMax or Max/MSP object to be used to adapt parameters to a specific room acoustics, or a new instrument, without needing to run Matlab.

Then of course a real iterative training (supervised by providing a reference alignment or unsupervised starting from the already good alignment to date) of the transition and observation probabilities is necessary to increase the robustness of the follower even more. This training could adapt to the ``style'' of a certain singer or musician.


Previous Contents Next