Next: The smoothing tradeoff Up: Time/frequency resolution in feature Previous: Time/frequency resolution in feature

## Is the uncertainty principle a barrier?

The time-frequency resolution tradeoff is well known:

 f t = K (1)

where t is the width of the time window, f is a measure of frequency resolution, and K is a constant that depends on the window shape. The tradeoff implies that it is impossible to attain infinite time and frequency resolution at the same time. This is often seen as a barrier, on the assumption that better resolution would be desirable were it possible to attain. For this reason much interest has been generated by the "hyperacuity" of the ear, for example the very sharp slopes of tuning curves, or pitch perception of short tones [Hartmann (1995) notes that the the ear "beats the limits of the uncertainty principle" by a factor of 5]. The auditory system might have some magic trick that allows it to go beyond the barrier of the uncertainty principle. Find the trick and make a breakthrough in speech recognition, etc.. This perspective is wrong.

In actual practice, the barrier should be seen from the other side. An FFT gives too much resolution rather than too little. Make the window too short and the spectrum fluctuates too much. Make it too large and the spectrum has too much detail. How to break out of this dilemna? Easy. Simply realize that (a) there's plenty of resolution: the lower limit imposed by the time-frequency tradeoff is not a problem in practice, (b) the FFT window size is the wrong place to try to determine resolution, and (c) the right place is in a smoothing stage, in the frequency and/or time domain, after the FFT.

Smoothing is of course a common practice. Feature extraction typically starts with an FFT with a shaped window of 20-25 ms. This size (if I remember correctly) is the result of trial and error, and a compromise between adequate time resolution of useful transients, and adequate time-domain smoothing of pitch-related fluctuations (Nadeu et al., 1997). Frequency resolution considerations don't enter the picture. Indeed, with a 40 Hz resolution the spectrum is much too detailed. The data rate is too large, and the pitch-related details that we got rid of in the time domain now appear in the frequency domain. This excess resolution is eliminated by smoothing in the frequency domain, for example by averageing over neighboring bins, or indirectly by choosing low-order coefficients of the cepstrum.

Another way of doing it is to start with a short FFT window, short enough to avoid excess resolution in the frequency domain. This is followed by smoothing in the time domain, for example by averageing consecutive spectra. The FFT itself involves a time average, so there are two consecutive time averages. Is this not equivalent to a single average with a larger window (in the FFT)? Not if the first average is followed by a non-linear operation such as taking the magnitude, or cepstrum. Temporal smoothing of magnitude spectra corresponds to the Welch method of spectral estimation. Smoothing of cepstra calculated with short windows has recently been proposed by Aikawa (reference?).

If time and frequency resolution are controlled by the post-FFT smoothing process, there is freedom to choose the resolution (window size) of the Fourier Transform within a wide range. Small-window, large window, both are OK as long as they are followed by smoothing to eliminate excess resolution and lower the data rate. Nothing prevents for example choosing different resolutions at different frequencies, for example with wavelet analysis or an "auditory" filter bank. Different choices are not equivalent, and there is room for experimentation. For such experimentation to be meaningful, spectral analysis should always be followed by the appropriate amount of smoothing in spectral and/or time domain.

Next: The smoothing tradeoff Up: Time/frequency resolution in feature Previous: Time/frequency resolution in feature
Alain de Cheveigne
1998-02-16