4 Audio
4.1 State of Affairs
The audio score following object suiviaudio is based on a Hidden Markov Model
(HMM), as described in [OD01]. It uses the audio features
log-energy and delta log-energy to distinguish rests from notes, and the energy
in harmonic bands according to the note pitch, and its delta, as described in
[OS01], to match the played notes to the expected notes. The energy in
harmonic bands is also called PSM for peak structure match.
As described in section 2.1, the following works very well, but only
monophonic scores could be parsed, and no ghost-states were implemented because
not enough examples were available to determine the transition probabilities
(i.e. all performences were presumed to be without errors).
4.2 What has been done
- Unify score parsing with suivimidi (section 6), which implies
adding ghost states for notes and rests also for audio. However, these stay
unused because not enough error data is available.
- Add a tuning parameter that gives the frequency for A, default is 440 Hz.
- Add a ``silence level'' control (which is not used, yet) for calculation
for the rest probability to adapt to different room acoustics.
- Add an integer input to force the follower k steps forwards (or
backwards if negative) to allow human intervention in case of a havarie (Midi
also).
- Add nozigzag mode to audio to prevent the follower to skip
backwards (as for Midi, not useful anymore because following is quite robust
now).
- The trill and general special event representation has been changed to
use a special Midi channel (given as a parameter to the follower) and to
represent the trill as a chord, instead of using the velocity to encode the
distance to the second note.
- Output rest. Define score representation for important rest, for
instance when a cue is to be output at its onset, note 127 on special channel
(trill channel). Same as for Midi.
- Set PDF parameters (threshold and mean). Either the PDF for one feature
and one state class (attack / sustain / release) can be set individually, or
all parameters can be changed from a matrix (jMax' fmat data
type, that can easily be imported from an ascii file). This way, parameters
determined from off-line training can be loaded (see section 5).
- Debug FFT: The window calculation was wrong, which prohibited the use of
the cepstral difference (see section 4.2.1) but did not perturb other
features and could therefore remain undetected.
- Add cepstral difference to detect vowel changes (see section 4.2.1).
4.2.1 Cepstral Difference
The cepstral difference, or cepstral flux, cpd is defined as
where R is the order of the cepstrum, here 12, c and c' are the vectors
of the current and the previous cepstral coefficients, respectively, calculated
from a window of the signal S by
c = IFFT ( log | FFT ( S ) | )
(2)
The term |FFT(S)| is already calculated for the PSM, so only the
inverse FFT calculation is added.
4.3 What is to be done
- Add fricative detection, e.g. with zero crossing rate (zcr), or by
spectral envelope. Define PDF for fricative feature. Define score
representation for fricative (e.g. Midi note 120 on special channel = former
trill channel, or in Midi text events).
- Add a calibration mode (a toggle) to measure what is considered
``silence'' in a certain acoustic environment (to adapt to room acoustics in a
concert hall).
- Change the Hidden Markov Model so that it can loop on the attack state of
each note, to better incorporate slow attacks and give more weight to the delta
features which are only taken into account in the attack state.