Having thus established smoothing as step separate from FFT, and
independant from its window size, let's look at how its parameters are
chosen. It was pointed out earlier that temporal smoothing must be a
compromise between tracking meaningful spectral transitions (for
example associated with consonants), and avoiding the effects of
spurious fluctuations due to voicing. Spectral smoothing involves a
similar compromise between detail of meaningful spectral features
(formants) and voice-related spectral ripple. A "compromise" between
two constraints suggests that both are satisfied suboptimally.
Because of the need to accomodate low-pitch voices, temporal smoothing
probably does not respect transitions the way it should. Yet because
of the need to respect them to some degree, voice-related
fluctuations are not perfectly removed. This is likely to be a
problem mainly for dynamic features (-cepstrum and
-cepstrum) that estimate slopes. Similar remarks might
be made in the frequency domain with respect to formant shapes and
positions, particularly in the low-frequency region. (slope-based
features are less common in the spectral domain, so insufficient
smoothing is perhaps less of a problem in the spectral than in the
time-domain).
Voice pitch-related fluctuations are one term of the compromise. If they could be eliminated, then the compromise could be shifted to the benefit the other term. This is the principle of pitch-period-smoothed feature extraction.