Auditory Scene Analysis and Perceptual Organisation: an update

C J Darwin

Experimental Psychology

University of Sussex

Introduction

Almost everyone would agree that we need to distinguish, logically, between

* explicit properties of the raw auditory signal, as reflected in the signals passing up the auditory nerve; and

* implicit properties that are specific to a single sound source.

Bob Carlyon and I (Darwin and Carlyon, 1995) have produced a recent extensive review of the simple cues that can be used to group together sounds for the purpose of making various perceptual judgements. The present talk presents some new material mainly from my own lab but also from others that has appeared since that review, and draws some conclusions.

So, for example, as Charles Bethell-Fox and I showed in 1977 (Darwin and Bethell-Fox, 1977) , the silence that cues a stop consonant can emerge as an implicit property of explicitly continuous speech when it is perceptually segmented by discontinuities in the pitch contour. The pitch discontinuities lead to the percept of two voices, and each voice is implicitly silent when the other voice is speaking.

The interesting question for sounds in general is, what are the mechanisms that make the initially implicit properties explicit. In particular for speech, in the terms used by Bregman in his book:

* how much of the segregation of speech sounds can be explained by low-level, general auditory grouping principles, and

* how much the grouping of speech sounds needs to recourse to more specific mechanisms that exploit the general schematic properties of speech or the specific properties of individual speech sounds or words.

Our work over the last 10 years has demonstrated that some simple grouping cues powerfully constrain the implicit features considered by speech-specific mechanisms, and also have broadly similar effects with complex but familiar non-speech sounds such as musical instruments.

Use of Fo differences in separating speech sounds

There have been two main paradigms used to establish the importance of pitch differences in separating simultaneous speech signals:

* Brokx and Nooteboom (1982) showed that identification of the content words of a semantically anomalous sentence resynthesised by LPC on a monotone increased from 40 to 60% with a pitch difference of from 0 to 3 semitones from background speech.

* Scheffers (1983) and later Assmann & Summerfield (1990) showed a much more rapid increase in the identification of simultaneous pairs of synthetic vowels ("double vowels") with change in Fo.

Greg Sandell and I have extended the generality of the last paradigm by asking listeners to identify not pairs of vowel , but pairs of musical instruments. These sounds were taken from recordings of real instruments (Opolko and Wapnick, 1989) had natural pitch and amplitude variations (apart from a gradual final decay to keep the durations constant at 500ms), and were resynthesised using a phase vocoder on average pitches increasing from Bb3 (233.1 Hz). The pattern of results is strikingly similar to that obtained for double vowels, with most of the improvement in performance taking place over the first semitone of pitch difference (see later). So it is unlikely that radically different mechanisms are responsible for the improvement in the case of vowels and that shown here for musical instruments.

Identification rate for pairs of instruments as function of pitch differences

Such results with artificially constrained steady-state vowels or musical notes have a different pattern from those using continuous resynthesised natural speech where performance continues to increase out to pitch differences of 3 semitones or more (Brokx et al., 1979; Brokx and Nooteboom, 1982) .

John Bird and I recently attempted to bridge the gap between the above two paradigms by looking at how the word identification rate increased with an Fo difference for sentences masked by other sentences, but where all the sentences contained only voiced consonants, and very few of these consonants were stops. We anticipated that the effect of pitch differences between the target sentence and the masking sentence for such stimuli might be increased by avoiding exploitable periods of silence in the masking voice. We also expected that subjects would not be able to use the rather exotic cue of waveform interactions (Assmann and Summerfield, 1994; Culling and Darwin, 1994) that gives such a rapid increase with double vowels, when we used continuous speech.

Stimulus configuration and response regime for Bird & Darwin experiment.

Our results showed a much larger effect of pitch differences (from 20% to 80% correct word identification than has been found in previous experiments, but one that did not show the initial very steep rise found for the double vowel experiments. In our experiment most of the improvement in identification occured between 1 and 2 semitones, a range where harmonic spacing becomes comparable with auditory bandwidth, but there was some continuing improvement out to 8 semitones.

Word recognition rate plotted as a function of relative pitch difference.

The next figure crudely compares our results with those obtained in a typical double vowel experiment, and with those obtained by Brokx & Nooteboom for semantically anomalous sentences.

These crude differences in the results could be due to:

* for double vowels the initial rapid increase in identification being due to waveform interactions (Culling and Darwin, 1994)

* which are no use for continuous speech;

* silences in one speaker allowing easy identification of the pitch of the other speaker, and

* pitch differences helping to keep track of a particular voice across time.

Word recognition rate plotted as a function of absolute % Fo difference.

Localisation: Effectiveness of ITD cues

There have been a number of demonstrations recently that ITD cues do not provide a strong basis for grouping the components of speech sounds. For example, Culling & Summerfield (1995) presented listeners with two simultaneous "whispered" vowels each of whose first two formants were represented by a pair of noise bands. They found that listeners were no better at identifying the vowel they heard on the left when the noise bands of the target vowel had a different interaural time difference (ITD) (+390 us) from those of the other vowel (-390 us) than when all four noise bands had the same ITD with the left ear leading (+390 us). This result argues against listeners being able to use a common ITD across different frequencies to perceptually segregate simultaneous complex sounds.

Rob Hukin and I (Hukin and Darwin, 1995b) partially confirmed Culling & Summerfield's conclusion using a paradigm which uses shifts in the phoneme boundary along an /I/-/e/ F1-continuum, to estimate the extent to which a single harmonic in the first formant region is being incorporated into the percept of the vowel. We found, as predicted by Culling & Summerfield's conclusions, that ITDs alone were rather ineffective at segregating a single harmonic from a vowel. However, we also found that they became more rather more effective when the to-be-segregated tone was part of a brief sequence of tones with the same ITD. The tone sequence itself tended to pull the harmonic from the vowel, but in addition that effect was larger when those tones all had a different ITD from the rest of the vowel than when they had the same ITD as the rest of the vowel. We also found that mixing conditions which had the tone sequence with conditions that did not increased the segregation cue to ITD for the conditions without the sequence, indicating a possible cueing of the direction of a potential additional sound source.

More recently Rob Hukin and I have demonstrated the dominance of grouping by frequency continuity over a common ITD (as suggested by Deutsch's scale ilusion (1975)) in vowel perception. We use the same paradigm as previously. Vowels (56 ms) from an /I/-/e/ F1-continuum F1-continuum as illustrated in the following figure are preceded by a pure tone (56 ms, ISI = 500 ms) at the fourth harmonic's frequency (500 Hz).

The preceding tone, the 500-Hz harmonic, and the remainder of the vowel are given ITDs of either +666us or -666us. The /I/-/e/ phoneme boundaries for the various conditions are shown in the next figures.

They show that:

* extending our previous result (Hukin and Darwin, 1995b) , in the context of trials with a preceding tone, an ITD is quite effective at segregating a harmonic from a vowel (L vs R);

* a single preceding tone at 500 Hz will remove the 500-Hz harmonic from the vowel more effectively when it is located on the side opposite the vowel than when it is on the same side as the vowel;

* moreover, it does this regardless of the side of the harmonic..

What seems to be happening is that the preceding tone forms a perceptual group with the 500-Hz harmonic regardless of ITDs. This group is then less likely to be incorporated into the vowel when the preceding tone originated from the opposite side to the vowel than when it originated from the same side. ITD is still furnishing a cue that helps segregation of the harmonic from the vowel, but is one whose effectiveness is modulated by grouping according to frequency continuity.

Ventriloquism effects

It is tempting for us, working in hearing and speech, to seek to explain the perceptual segregation of sounds in terms that are purely auditory. But a recent experiment by Jon Driver (1996) shows dramatically that interactions between vision and hearing can have a substantial influence over effects which we have previously tried to explain in purely acoustic or auditory terms.

Driver played to subjects two simultaneous triplets of di-syllabic words from a single loudspeaker. A TV monitor showed a face speaking the target triplet. Subjects were instructed to recall only the target triplet. When the monitor was above the loudspeaker, subjects recalled 59% of the target words, but when the monitor was above a dummy loudspeaker 27deg., subjects recalled 78% of the target words. Illusory relocation of a sound source thus appears top be giving a substantial amount of "release from masking" with no acoustic or auditory separation. It is an intriguing question whether a binaural detection task could reveal a similar advantage for illusory spatial separation. But it is perhaps more likely that the effect arises at the level of trying to track a particular acoustic source over time.

Generality, and the nature of grouped representations

Finally, I would like to address the question of how general grouping mechanisms are, and what we can deduce from the generality of the effects that we find about the mechanisms and representations underlying the phenomena of auditory grouping.

First, it seems clear that within a particular perceptual task quantitatively similar results for a given grouping cue are obtained from experiments that use very different types of sound. For example, the tolerance for harmonic mistuning in pitch perception is the same regardless of whether the pitch is that of a flat-spectrum complex (Moore et al., 1985; Darwin and Carlyon, 1995; Hukin and Darwin, 1995b) , a vowel (Hukin and Darwin, 1995a) or a natural, or stylised musical instrument (Sandell et al., 1995) .

Second, however, there are marked parametric differences in the effectiveness of a particular cue across different perceptual tasks. For instance, much less onset asynchrony is needed to segregate a harmonic from a vowel in determining the vowel's quality, than in determining the vowel's pitch (Hukin and Darwin, 1995a) .

This observation suggests that it would be simplistic to suppose that general mechanisms created immutable groups so that subsequent categorisers had effectively no access to earlier, parametric information.

It seems more likely that each categorising mechanism has access to parametric information about individual grouping cues.

The special case of speech?

The role of auditory grouping mechanisms in speech perception have been the subject of considerable debate (Whalen and Liberman, 1987; Remez et al., 1994) . The ability of listeners to understand sentences cued only by three time-varying sinusoids at the formant frequencies is strong evidence that otherwise independent sounds can contribute to a common phonetic percept. Such evidence for what Bregman has called schema-driven grouping mechanisms (which may involve a substantial contribution from vision) does not preclude low-level grouping mechanisms also operating to aid the perception of speech. Given that speech itself can be regarded as a sound produced by multiple sources (the vocal folds vibrating and/or producing aspiration, fricative sound sources, explosive bursts, clicks etc), general auditory processes may be more useful in producing appropriate abstract auditory features for schema-based processes to work on, than in producing a complete speech sound source. For example, it appears to be the case that much less onset asynchrony is needed to segregate a harmonic from a vowel, than a formant from a vowel (Darwin, 1981; Darwin, 1984) . Low level grouping cues may be more effective at the level of identifying formant frequencies, than at the task of grouping together formants into a vowel percept.

References

Assmann, P. F. and Summerfield, A. Q. (1990). "Modelling the perception of concurrent vowels: Vowels with different fundamental frequencies.," J. Acoust. Soc. Am. 88, 680-697.

Assmann, P. F. and Summerfield, A. Q. (1994). "The contribution of waveform interactions to the perception of concurrent vowels," J. Acoust. Soc. Am. 95, 471-484.

Brokx, J. P. L. and Nooteboom, S. G. (1982). "Intonation and the perceptual separation of simultaneous voices, .," J. Phon. 10, 23-36.

Brokx, J. P. L., Nooteboom, S. G. and Cohen, A. (1979). "Pitch differences and the integrity of speech masked by speech," IPO Annual Progress Report 14, 55-60.

Culling, J. E. and Darwin, C. J. (1994). "Perceptual and computational separation of simultaneous vowels: cues arising from low frequency beating," J. Acoust. Soc. Am. 95, 1559 - 1569.

Culling, J. F. and Summerfield, Q. (1995). "Perceptual separation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay," J. Acoust. Soc. Am. 98, 785-797.

Darwin, C. J. (1981). "Perceptual grouping of speech components differing in fundamental frequency and onset-time," Quart. J. Exp. Psychol. 33A, 185-208.

Darwin, C. J. (1984). "Perceiving vowels in the presence of another sound: constraints on formant perception.," J. Acoust. Soc. Am. 76, 1636-1647.

Darwin, C. J. and Bethell-Fox, C. E. (1977). "Pitch continuity and speech source attribution," J. exp. Psychol.: Hum. Perc. & Perf. 3, 665-672.

Darwin, C. J. and Carlyon, R. P. (1995). "Auditory grouping," in The handbook of perception and cognition, Volume 6, Hearing edited by B. C. J. Moore (Academic, London), pp. 387-424.

Deutsch, D. (1975). "Two-channel listening to musical scales.," J. Acoust. Soc. Am. 57, 1156-1160.

Driver, J. (1996). "Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading," Nature 381, 66-68.

Hukin, R. W. and Darwin, C. J. (1995a). "Comparison of the effect of onset asynchrony on auditory grouping in pitch matching and vowel identification," Percept. Psychophys. 57, 191-196.

Hukin, R. W. and Darwin, C. J. (1995b). "Effects of contralateral presentation and of interaural time differences in segregating a harmonic from a vowel," J. Acoust. Soc. Am. 98, 1380-1387.

Moore, B. C. J., Glasberg, B. R. and Peters, R. W. (1985). "Relative dominance of individual partials in determining the pitch of complex tones," J. Acoust. Soc. Am. 77, 1853-1860.

Opolko, F. and Wapnick, J. (1989). McGill University Master Samples User's Manual McGill University, Faculty of Music, 555 Sherbrooke Street West, Montreal, Quebec, Canada H3A 1E3.).

Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S. and Lang, J. M. (1994). "On the perceptual organization of speech," Psych. Rev. 101, 129-156.

Sandell, G. J., Schloerscheidt, A. and Darwin, C. J. (1995). "Grouping of Harmonics in Natural vs. Synthetic Musical Instrument Tones," Society for Music Perception and Cognition Conference; Berkeley, June 22-25, 1995.

Scheffers, M. T. (1983). "Sifting vowels: Auditory pitch analysis and sound segregation," Groningen University, The Netherlands. Ph.D. dissertation.

Whalen, D. M. and Liberman, A. M. (1987). "Speech perception takes precedence over nonspeech perception," Science 237, 169-71.