Having thus established smoothing as step separate from FFT, and independant from its window size, let's look at how its parameters are chosen. It was pointed out earlier that temporal smoothing must be a compromise between tracking meaningful spectral transitions (for example associated with consonants), and avoiding the effects of spurious fluctuations due to voicing. Spectral smoothing involves a similar compromise between detail of meaningful spectral features (formants) and voice-related spectral ripple. A "compromise" between two constraints suggests that both are satisfied suboptimally. Because of the need to accomodate low-pitch voices, temporal smoothing probably does not respect transitions the way it should. Yet because of the need to respect them to some degree, voice-related fluctuations are not perfectly removed. This is likely to be a problem mainly for dynamic features (-cepstrum and -cepstrum) that estimate slopes. Similar remarks might be made in the frequency domain with respect to formant shapes and positions, particularly in the low-frequency region. (slope-based features are less common in the spectral domain, so insufficient smoothing is perhaps less of a problem in the spectral than in the time-domain).
Voice pitch-related fluctuations are one term of the compromise. If they could be eliminated, then the compromise could be shifted to the benefit the other term. This is the principle of pitch-period-smoothed feature extraction.