Optimal smoothing of short-window features

The short-window magnitude spectrum (Welch method) or cepstrum (Aikawa) can benefit by smoothing over a square window with a length equal to a period or period multiple (or any convolution of such windows). Actually any estimate (LPC, etc.) can be smoothed in that way, but the benefit is greatest for estimates that use short windows.

Smoothing with a square window with a length equal to a period is
like applying a low pass-filter with zeros at the fundamental and all
its multiples. All *F*_{0}-related fluctuations are smoothed out,
leaving only the DC-component. [Note: spectral analysis is supposed
to be *running*, that is, repeated at one-sample intervals. Each
component of a spectral vector may be seen as a time-varying signal.
It is these signals that are low-pass filtered by convolution with a
square window].

Again, reliability of *F*_{0}estimation is not a big issue. If the
*F*_{0}-estimator chooses a period multiple instead of the period, the
zeros will more closely spaced but they will still cancel out the
fundamental and its multiples. If the *F*_{0}estimate is completely
wrong, this means that the signal was not very periodic: little is
lost by using the "wrong" window size. One might be worried about
small estimate errors, for example due to the finite sampling rate.
These cause the zeros of the transfer function to be more and more
shifted with respect to the harmonics of *F*_{0}as frequency rises, and
therefore less well suppressed. This should not be a major problem,
as the energy of the fluctuation is likely to to be small in the high
frequency region, and attenuated anyhow by the overall decrease of the
*sin*(*x*)/*x* transfer function (this decrease can be made faster, at the
expense of time resolution, by using a triangular window of length
2/*F*_{0}, or some higher-order convolution of a square window). Note
that we are talking about the frequency content of *spectral
coefficients as a function of time*: the high-frequency region of the
spectrum of the signal itself is as adequately represented as the low
frequency region.

The advantage of PP-smoothing is that it allows perfect time-domain
smoothing of features from voiced speech. To do so it imposes the
smallest possible penalty on temporal resolution. Temporal resolution
is a combination of two factors: the window size of spectral analysis,
and the window size of the smoothing. The latter, equal to a period,
is as short as can be imagined. The former is generally even shorter
and its contribution negligeable. Without PP-smoothing, to get stable
estimates we would have to choose a smoothing window at least as large
as the *lowest* expected period, and preferably larger. The
improvement in temporal resolution can be considerable (of course one
is not obliged to use it: there might be reasons to throw away some of
this resolution).