The reason why speech recognition systems are so sensitive to noise is that they stupidly try to recognize it. They should know better and ignore it. The reason why they are sensitive to spectral distortion, etc., is that they have trouble matching distorted features to undistorted templates. The reason why humans perform much better is that they know how to handle missing or damaged features.
Missing or damaged features occur if there is a "dropout", or a channel is filtered, or if an "Auditory Scene Analysis" (or some other system) has recognized that part of the features (for example a noise burst) don't belong to the speech, and has removed them or flagged them as damaged.
Missing feature theory has been pioneered by the Sheffield group (Cooke et al. 1993, 1994a,b, 1996, ????, Barker et al., 1997), although some of the ideas have been explored elsewhere (de Cheveigné, 1993, 1996 & others, sorry no refs). Cooke et al. (1996) identified several possible ways of dealing with a feature (for example a certain spectro-temporal region) that is occluded by interference:
These are listed in order of "smartness". The first two schemes are pretty brain-dead. The third is not much better. The fourth has been the object of much excitement, based on the notion of "phoneme restoration". I personally feel that it is based on a misinterpretation of Warren's experiments (see the second half of http://www.linguist.jussieu.fr/ alain/sh/keele/t/thoughts.html). "Restoring" missing features by spectral and/or temporal interpolation or extrapolation sounds nice, but it has to be less optimal than just ignoring them. The reason is that no process can re-create information that is lost. By interpolating the system makes a guess as to the missing value, and this guess might be wrong (imagine for example if a consonant is missing, and the system replaces it by a smooth transition between vowels). The best strategy is instead to simply ignore what is missing, and put a weight of zero on the missing feature in all of its decisions.
The last scheme (6) is the smartest. The other schemes require the "missing portions" to be labeled as such by some prior "scene analysis" stage. The last does the labelling itself, by comparison with the templates. If signal energy in some time/frequency zone is greater than a template, then the signal could correspond to that template (plus a bit of noise). If signal energy is smaller, then the signal definitely does not match the template. The scheme is smart, but rather tricky because the data have to be normalized before they can be compared with the templates. Not much bang for the buck, IMHO.
For schemes 1-4, as soon as the missing data have been given a value, the features can be passed on to the recognizer.
For scheme 5, the recognizer must be instructed to ignore parts of the features. It must therefore have two inputs: the feature itself, and a flag or map that specifies which part to ignore. It must also of course be able to use that information. The Sheffield group has proposed techniques for this purpose, within the framework of HMMs.