C J Darwin
Experimental Psychology
University of Sussex
* explicit properties of the raw auditory signal, as reflected in the signals passing up the auditory nerve; and
* implicit properties that are specific to a single sound source.
Bob Carlyon and I
(Darwin
and Carlyon, 1995)
have produced a recent extensive review of the simple cues that can be used to
group together sounds for the purpose of making various perceptual judgements.
The present talk presents some new material mainly from my own lab but also
from others that has appeared since that review, and draws some conclusions.
So, for example, as Charles Bethell-Fox and I showed in 1977
(Darwin
and Bethell-Fox, 1977)
,
the silence that cues a stop consonant can emerge as an implicit property of
explicitly continuous speech when it is perceptually segmented by
discontinuities in the pitch contour. The pitch discontinuities lead to the
percept of two voices, and each voice is implicitly silent when the other voice
is speaking.
The interesting question for sounds in general is, what are the mechanisms that make the initially implicit properties explicit. In particular for speech, in the terms used by Bregman in his book:
* how much of the segregation of speech sounds can be explained by low-level, general auditory grouping principles, and
* how much the grouping of speech sounds needs to recourse to more specific mechanisms that exploit the general schematic properties of speech or the specific properties of individual speech sounds or words.
Our work over the last 10 years has demonstrated that some simple grouping cues powerfully constrain the implicit features considered by speech-specific mechanisms, and also have broadly similar effects with complex but familiar non-speech sounds such as musical instruments.
* Brokx and Nooteboom
(1982)
showed that identification of the content words of a semantically anomalous
sentence resynthesised by LPC on a monotone increased from 40 to 60% with a
pitch difference of from 0 to 3 semitones from background speech.
* Scheffers
(1983)
and later Assmann & Summerfield
(1990)
showed a much more rapid increase in the identification of simultaneous pairs
of synthetic vowels ("double vowels") with change in Fo.
Greg Sandell and I have extended the generality of the last paradigm by asking
listeners to identify not pairs of vowel , but pairs of musical instruments.
These sounds were taken from recordings of real instruments
(Opolko
and Wapnick, 1989)
had natural pitch and amplitude variations (apart from a gradual final decay to
keep the durations constant at 500ms), and were resynthesised using a phase
vocoder on average pitches increasing from Bb3 (233.1 Hz). The pattern of
results is strikingly similar to that obtained for double vowels, with most of
the improvement in performance taking place over the first semitone of pitch
difference (see later). So it is unlikely that radically different mechanisms
are responsible for the improvement in the case of vowels and that shown here
for musical instruments.

Identification rate for pairs of instruments as function of pitch differences
Such results with artificially constrained steady-state vowels or musical notes
have a different pattern from those using continuous resynthesised natural
speech where performance continues to increase out to pitch differences of 3
semitones or more
(Brokx
et al., 1979; Brokx and Nooteboom, 1982)
.
John Bird and I recently attempted to bridge the gap between the above two
paradigms by looking at how the word identification rate increased with an Fo
difference for sentences masked by other sentences, but where all the sentences
contained only voiced consonants, and very few of these consonants were stops.
We anticipated that the effect of pitch differences between the target sentence
and the masking sentence for such stimuli might be increased by avoiding
exploitable periods of silence in the masking voice. We also expected that
subjects would not be able to use the rather exotic cue of waveform interactions
(Assmann
and Summerfield, 1994; Culling and Darwin, 1994)
that gives such a rapid increase with double vowels, when we used continuous
speech.

Stimulus configuration and response regime for Bird & Darwin experiment.
Our results showed a much larger effect of pitch differences (from 20% to 80% correct word identification than has been found in previous experiments, but one that did not show the initial very steep rise found for the double vowel experiments. In our experiment most of the improvement in identification occured between 1 and 2 semitones, a range where harmonic spacing becomes comparable with auditory bandwidth, but there was some continuing improvement out to 8 semitones.

Word recognition rate plotted as a function of relative pitch difference.
The next figure crudely compares our results with those obtained in a typical double vowel experiment, and with those obtained by Brokx & Nooteboom for semantically anomalous sentences.
These crude differences in the results could be due to:
* for double vowels the initial rapid increase in identification being due to
waveform interactions
(Culling
and Darwin, 1994)

* which are no use for continuous speech;
* silences in one speaker allowing easy identification of the pitch of the other speaker, and
* pitch differences helping to keep track of a particular voice across time.

Word recognition rate plotted as a function of absolute % Fo difference.
(1995)
presented listeners with two simultaneous "whispered" vowels each of whose
first two formants were represented by a pair of noise bands. They found that
listeners were no better at identifying the vowel they heard on the left when
the noise bands of the target vowel had a different interaural time difference
(ITD) (+390 us) from those of the other vowel (-390 us) than when all four
noise bands had the same ITD with the left ear leading (+390 us). This result
argues against listeners being able to use a common ITD across different
frequencies to perceptually segregate simultaneous complex sounds.
Rob Hukin and I
(Hukin
and Darwin, 1995b)
partially confirmed Culling & Summerfield's conclusion using a paradigm
which uses shifts in the phoneme boundary along an /I/-/e/ F1-continuum, to
estimate the extent to which a single harmonic in the first formant region is
being incorporated into the percept of the vowel. We found, as predicted by
Culling & Summerfield's conclusions, that ITDs alone were rather
ineffective at segregating a single harmonic from a vowel. However, we also
found that they became more rather more effective when the to-be-segregated
tone was part of a brief sequence of tones with the same ITD. The tone
sequence itself tended to pull the harmonic from the vowel, but in addition
that effect was larger when those tones all had a different ITD from the rest
of the vowel than when they had the same ITD as the rest of the vowel. We also
found that mixing conditions which had the tone sequence with conditions that
did not increased the segregation cue to ITD for the conditions without the
sequence, indicating a possible cueing of the direction of a potential
additional sound source.
More recently Rob Hukin and I have demonstrated the dominance of grouping by
frequency continuity over a common ITD (as suggested by Deutsch's scale ilusion
(1975)
)
in vowel perception. We use the same paradigm as previously. Vowels (56 ms)
from an /I/-/e/ F1-continuum F1-continuum as illustrated in the following
figure are preceded by a pure tone (56 ms, ISI = 500 ms) at the fourth
harmonic's frequency (500 Hz).

The preceding tone, the 500-Hz harmonic, and the remainder of the vowel are given ITDs of either +666us or -666us. The /I/-/e/ phoneme boundaries for the various conditions are shown in the next figures.


They show that:
* extending our previous result
(Hukin
and Darwin, 1995b)
,
in the context of trials with a preceding tone, an ITD is quite effective at
segregating a harmonic from a vowel (L vs R);
* a single preceding tone at 500 Hz will remove the 500-Hz harmonic from the vowel more effectively when it is located on the side opposite the vowel than when it is on the same side as the vowel;
* moreover, it does this regardless of the side of the harmonic..
What seems to be happening is that the preceding tone forms a perceptual group with the 500-Hz harmonic regardless of ITDs. This group is then less likely to be incorporated into the vowel when the preceding tone originated from the opposite side to the vowel than when it originated from the same side. ITD is still furnishing a cue that helps segregation of the harmonic from the vowel, but is one whose effectiveness is modulated by grouping according to frequency continuity.
(1996)
shows dramatically that interactions between vision and hearing can have a
substantial influence over effects which we have previously tried to explain in
purely acoustic or auditory terms.Driver played to subjects two simultaneous triplets of di-syllabic words from a single loudspeaker. A TV monitor showed a face speaking the target triplet. Subjects were instructed to recall only the target triplet. When the monitor was above the loudspeaker, subjects recalled 59% of the target words, but when the monitor was above a dummy loudspeaker 27deg., subjects recalled 78% of the target words. Illusory relocation of a sound source thus appears top be giving a substantial amount of "release from masking" with no acoustic or auditory separation. It is an intriguing question whether a binaural detection task could reveal a similar advantage for illusory spatial separation. But it is perhaps more likely that the effect arises at the level of trying to track a particular acoustic source over time.
First, it seems clear that within a particular perceptual task quantitatively
similar results for a given grouping cue are obtained from experiments that use
very different types of sound. For example, the tolerance for harmonic
mistuning in pitch perception is the same regardless of whether the pitch is
that of a flat-spectrum complex
(Moore
et al., 1985; Darwin and Carlyon, 1995; Hukin and Darwin, 1995b)
, a vowel
(Hukin
and Darwin, 1995a)
or a natural, or stylised musical instrument
(Sandell
et al., 1995)
.
Second, however, there are marked parametric differences in the effectiveness
of a particular cue across different perceptual tasks. For instance, much less
onset asynchrony is needed to segregate a harmonic from a vowel in determining
the vowel's quality, than in determining the vowel's pitch
(Hukin
and Darwin, 1995a)
.
This observation suggests that it would be simplistic to suppose that general mechanisms created immutable groups so that subsequent categorisers had effectively no access to earlier, parametric information.

It seems more likely that each categorising mechanism has access to parametric information about individual grouping cues.
(Whalen
and Liberman, 1987; Remez et al., 1994)
.
The ability of listeners to understand sentences cued only by three
time-varying sinusoids at the formant frequencies is strong evidence that
otherwise independent sounds can contribute to a common phonetic percept. Such
evidence for what Bregman has called schema-driven grouping mechanisms (which
may involve a substantial contribution from vision) does not preclude low-level
grouping mechanisms also operating to aid the perception of speech. Given
that speech itself can be regarded as a sound produced by multiple sources (the
vocal folds vibrating and/or producing aspiration, fricative sound sources,
explosive bursts, clicks etc), general auditory processes may be more useful in
producing appropriate abstract auditory features for schema-based processes to
work on, than in producing a complete speech sound source. For example, it
appears to be the case that much less onset asynchrony is needed to segregate a
harmonic from a vowel, than a formant from a vowel
(Darwin,
1981; Darwin, 1984)
.
Low level grouping cues may be more effective at the level of identifying
formant frequencies, than at the task of grouping together formants into a
vowel percept.
Assmann,
P. F. and Summerfield, A. Q. (1990). "Modelling the perception of
concurrent vowels: Vowels with different fundamental frequencies.," J. Acoust.
Soc. Am. 88, 680-697.Assmann, P. F. and Summerfield, A. Q. (1994). "The contribution of waveform interactions to the perception of concurrent vowels," J. Acoust. Soc. Am. 95, 471-484.
Brokx, J. P. L. and Nooteboom, S. G. (1982). "Intonation and the perceptual separation of simultaneous voices, .," J. Phon. 10, 23-36.
Brokx, J. P. L., Nooteboom, S. G. and Cohen, A. (1979). "Pitch differences and the integrity of speech masked by speech," IPO Annual Progress Report 14, 55-60.
Culling, J. E. and Darwin, C. J. (1994). "Perceptual and computational separation of simultaneous vowels: cues arising from low frequency beating," J. Acoust. Soc. Am. 95, 1559 - 1569.
Culling, J. F. and Summerfield, Q. (1995). "Perceptual separation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay," J. Acoust. Soc. Am. 98, 785-797.
Darwin, C. J. (1981). "Perceptual grouping of speech components differing in fundamental frequency and onset-time," Quart. J. Exp. Psychol. 33A, 185-208.
Darwin, C. J. (1984). "Perceiving vowels in the presence of another sound: constraints on formant perception.," J. Acoust. Soc. Am. 76, 1636-1647.
Darwin, C. J. and Bethell-Fox, C. E. (1977). "Pitch continuity and speech source attribution," J. exp. Psychol.: Hum. Perc. & Perf. 3, 665-672.
Darwin, C. J. and Carlyon, R. P. (1995). "Auditory grouping," in The handbook of perception and cognition, Volume 6, Hearing edited by B. C. J. Moore (Academic, London), pp. 387-424.
Deutsch, D. (1975). "Two-channel listening to musical scales.," J. Acoust. Soc. Am. 57, 1156-1160.
Driver, J. (1996). "Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading," Nature 381, 66-68.
Hukin, R. W. and Darwin, C. J. (1995a). "Comparison of the effect of onset asynchrony on auditory grouping in pitch matching and vowel identification," Percept. Psychophys. 57, 191-196.
Hukin, R. W. and Darwin, C. J. (1995b). "Effects of contralateral presentation and of interaural time differences in segregating a harmonic from a vowel," J. Acoust. Soc. Am. 98, 1380-1387.
Moore, B. C. J., Glasberg, B. R. and Peters, R. W. (1985). "Relative dominance of individual partials in determining the pitch of complex tones," J. Acoust. Soc. Am. 77, 1853-1860.
Opolko, F. and Wapnick, J. (1989). McGill University Master Samples User's Manual McGill University, Faculty of Music, 555 Sherbrooke Street West, Montreal, Quebec, Canada H3A 1E3.).
Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S. and Lang, J. M. (1994). "On the perceptual organization of speech," Psych. Rev. 101, 129-156.
Sandell, G. J., Schloerscheidt, A. and Darwin, C. J. (1995). "Grouping of Harmonics in Natural vs. Synthetic Musical Instrument Tones," Society for Music Perception and Cognition Conference; Berkeley, June 22-25, 1995.
Scheffers, M. T. (1983). "Sifting vowels: Auditory pitch analysis and sound segregation," Groningen University, The Netherlands. Ph.D. dissertation.
Whalen, D. M. and Liberman, A. M. (1987). "Speech perception takes
precedence over nonspeech perception," Science 237, 169-71.
