The bottom line

Next: The cancellation principle in Up: Time/frequency resolution in feature Previous: STRAIGHT

The bottom line

So, how best to apply these ideas to SR? The PP-delta feature calculation method is the first thing to try. It's easy to implement, and should give the most "bang for the buck", especially when $\triangle^{2}_{}$ features are needed, and on databases containing low-pitch (male) voices. It can be applied to any existing method of extraction of primary features, and should give an immediate improvement by reducing the noise of their $\triangle$ s. A difficulty is that the $\triangle$ s can no longer be derived from the low-frame-rate sequence of primary features. Primary and $\triangle$ features must be calculated together, and calculations are a bit more expensive (FFTs must be repeated after an 1/F₀ time shift).

Second thing to try, given that PP-delta takes care of voice-related fluctuation, is to experiment with smaller amounts of smoothing of the primary features, and sharper differentiation formulas. By relaxing one term of the compromise, the other may be made more optimal.

Third thing to try is PP-smoothing of the primary features, in particular the PP version of the Welch or Aikawa methods. Not so much to get smoother spectra, but to gain freedom to experiment. For example it might be interesting to redo a study such as that of Nadeu et al. (1997) on time sequences of spectral parameters (TSSP), using PP-smoothed parameters. PP-smoothing allows extra flexibility to experiment with various window sizes and shapes (and types of non-linearity to insert between the FFT and the smoothing: magnitude, cepstrum, etc..), without risking excess fluctuation. Non-uniform frequency resolution (wavelets, auditory filterbank) is another thing to try, understood that whatever the frequency resolution, temporal resolution will be adjusted by PP-smoothing. PP-smoothing can also be applied to more sophisticated auditory models like AIM.

The computational expense may be relatively high: features must be calculated at a high frame rate before smoothing (after smoothing they may be sampled at standard rates such a 100 Hz). One can take advantage of the fact that FFT windows are small to reduce this expense. It might even be possible to use a form of running DFT rather than FFT, at least for square windows.

A final thing to try is full-blown STRAIGHT analysis. This is computationally expensive, but given the usefulness of STRAIGHT someone may invest in a highly efficient implementation. Given the low spectral resolution required by SR, it is probably overkill.

Next: The cancellation principle in Up: Time/frequency resolution in feature Previous: STRAIGHT

Alain de Cheveigne
1998-02-16