STRAIGHT

Next: The bottom line Up: Time/frequency resolution in feature Previous: Optimal FFT window

STRAIGHT

The most sophisticated incarnation of the PP-smoothing idea is Kawahara's STRAIGHT method. The F₀ estimate is used at three stages in STRAIGHT. First, it determines the size of a gaussian window that is applied to the data before the initial FFT. This window is an adaptive analog of the narrow window of the PP-Welch method of Sect. . Making its size depend on the period optimizes resolution in both time and frequency domains. Next, F₀ determines the size of the window that smooths the spectral coefficients in the time domain. Finally, it determines the size of the window that smoothes the spectral coefficients in the spectral domain. This two-way temporal and spectral smoothing removes any evidence of voicing in either domain.

Standard STRAIGHT uses triangular windows (convolution of two square windows of length 1/F₀) to produce a very smooth spectrum (no discontinuities). It would work also with square windows, if staircase discontinuities of the spectral envelope are acceptable. The very smooth spectrum produced by triangular windows is probably desirable for high quality resynthesis: the spectral envelope extracted at one F₀is used to resynthesize at other, possibly time-varying F₀s. Irregularities such as staircase discontinuities might produce audible artifacts. The cost of the extra smoothing is of course reduced resolution in both time and frequency domain.

STRAIGHT adds to the PP-Welch method of Sect. an optimal smoothing in the frequency domain. The PP-Welch method assumed (tacitly) that the FFT window was short relative to the period, and therefore that smoothing was not required in the spectral domain. Spectral resolution and the quality of smoothing in the spectral domain were both suboptimal compared to STRAIGHT. This is probably not so much a problem in Speech Recognition, where most details of the spectral representation are thrown away to reduce the data rate.

The drawbacks of STRAIGHT are that it is computationally expensive, and probably somewhat of an overkill for Speech Recognition applications. It is also susceptible to subharmonic errors in the F₀-estimation process: a period-doubling or tripling would result in insufficient smoothing in the spectral domain, and the spectrum would turn out comb-shaped rather than smooth.

Next: The bottom line Up: Time/frequency resolution in feature Previous: Optimal FFT window

Alain de Cheveigne
1998-02-16