next up previous
Next: How to implement these Up: Dealing with damaged and Previous: Template adjustment


A first example is the previous filtering channel (telephone, microphone, room characteristics). If the characteristics of the channel don't fluctuate too much and can be estimated, the speech can be reemphasized. Portions for which attenuation was severe (or perhaps difficult to estimate) are given a small weight (possibly zero) to compensate for their unreliability.

A second example is predictable noise. 50/60Hz hum, hiss, reverberation background, periodic helicopter noise, wideband noise bursts, etc.. If the position of the noise in time and/or frequency can be determined, then the affected features can be given zero weight.

A third example is harmonic cancellation. Harmonic cancellation removes an interfering voice by applying a comb filter that suppresses all of its components. It may also suppress some of the target components, and this spectral distortion may interfere with recognition (de Cheveigné, 1993). Recognition can be improved by applying (a) an inverse filter to compensate for the distortion, and (b) a confidence measure to reduce the weight of features in the vicinity of the zeros of the comb filter. Alternatively, the same distortion can be applied to the templates. This scheme is effective, as described in an ATR technical report (de Cheveigné, 1993).

A fourth example, similar to the third, is blind separation, or binaural cancellation. The combination of the mixing matrix and the separation matrix constitutes a linear filter that necessarily introduces spectral distortion. However this distortion can be estimated, and therefore compensated for in the recognition stage.

A fifth example is a hypothetical recognizer using prosodic information such as F0. F0 estimation is often unreliable, but it is usually possible to know when that is the case. A reliability measure produced by the F0 estimator (Kawahara's "fundamentalness", the AMDF residual) can be used as a confidence measure.

A sixth example is a hypothetical recognizer that uses both acoustic and visual (face) information. If the reliability of each channel (A and V) can be estimated, then the relative weights of both sources can be adjusted. For example if a speaker turns away from the camera, visual information should be ignored. If a jet plane goes overhead, acoustic information should be ignored, etc.. Massaro has shown that this is how humans integrate information from multiple sensory dimensions.

This last example is of course hypothetical, but it suggests that missing feature theory might be a key to recognition based on multimodal cues.

next up previous
Next: How to implement these Up: Dealing with damaged and Previous: Template adjustment
Alain de Cheveigne