PSYCHOPHYSICAL EXPERIMENTS ON
AUDITORY SCENE ANALYSIS
Two short steps toward measuring the segregation of vowels
Pierre L. Divenyi
Speech and Hearing Research Facility
Veterans Affairs Medical Center
Martinez, CA 94553
PDivenyi@ucdavis.edu
ABSTRACT
Auditory segregation of simultaneous vowel-like sounds relies on the perceptual separation of fundamental frequency and that of formant envelopes. These two tasks appear to interact on a variety of levels: the spectral position of the formant with regard to fundamental frequency, the harmonicity of the components, the distance between the two fundamental frequencies, and the fundamental frequency contour. Results indicate a rather clear trade-off relation between fundamental frequency resolution and formant peak resolution, i.e., between residue and timbre.
1. INTRODUCTION[1]
Auditory scene analysis ([1]) refers to the perceptual decomposition of a complex auditory display into its elements, i.e., its component events. The process is assumed to work by grouping together spectrally, temporally, and/or spatially coherent parts of the display, with the aim of recover the output of the separate sources that create it. During the past decade, the study of auditory scene analysis has attracted a number of investigators (for a review, see [2]). Research in this area is motivated by interest in solving two puzzles: how the auditory nervous system accomplishes the task and what algorithm can be designed to do the same. While the challenge to solve the two puzzles seems about equal, significantly more effort (and financial support) has been directed to research toward the engineering/technological than to the behavioral/neurophysiological solution.
One issue central to auditory scene analysis is the perceptual separation of concurrently active voice sources, i.e., the cocktail-party effect-a problem area of equal interest to engineers and auditory scientists ever since Cherry [3] has coined the term. Perhaps the major stumbling block impeding behavioral experiments in the area is that few listening experiments have been able to investigate voice separation within the framework of objective psychophysical paradigms: most studies in this area are based on introspective evaluation of the auditory stimulus (e.g., [4], [5], [6])2[1]. The present paper follows earlier attempts by the author to introduce an objective paradigm measuring auditory separation of simultaneous pairs of events ([7], [8]), and to focus on a reduced version of the problem of perceptually separating two concurrent speech sources - the perceptual segregation of pairs of vowel-like sounds.
2. METHODS
The methods must address two practical issues: (1) the reduction of the complex multi-voice stimulus present in the cocktail-party effect to a stimulus configuration that will allow the results to be interpretable in terms of known, or suspected, auditory processes, and (2) choice of an objective psychophysical paradigm which can be portrayed as measuring separability of simultaneous objects.
2.1. The stimuli
The stimuli selected reduce the problem of simultaneous voices to one of superimposed vowels. Concurrent vowels uttered by different speakers differ with respect to two main discrete attributes: fundamental frequency f0 and formant structure, the latter expressed as the spectral envelope of the vowel. In the stimuli of the present experiments, the formant structure is further reduced to the spectral envelope of a single formant with peak at frequency f1; the number of harmonic or inharmonic partials defining the formant is a parameter. Inharmonic partials, though not normally occurring in spoken vowels, are included because of their importance for determining pitch strength, as well as because of their theoretical relevance in computational auditory scene analysis (e.g., [9]). Furthermore, f0 is held constant, in Experiment 1, or it is allowed to vary, in Experiment 2, according to a defined contour pattern. We will call this one-formant vowel a promel3; each stimulus consists of two superimposed promels.
Within each block of trials of Experiment 1, the two promels have fixed
f0 values. The f1 formant peak varies according to an adaptive
staircase paradigm. The number of partials is fixed at 7 and the frequency
difference of neighboring partials is always equal to one of the f0
values. The serial number of the lowest partial is determined by the
experimental condition and it is chosen to maximize the overlap of the spectral
envelopes of the two promels.[4 In some
conditions, both promels consist of consecutive harmonic (i.e., integral
multiple) partials of the fundamental. In other conditions, either one or both
promels are inharmonic; in these cases they are generated from partials which
are midway between consecutive harmonics. The seven partials are summed in
pseudo-random phase.
Varying in accordance with the adaptive method are the formant envelopes of
both promels. The formant region chosen is either around 1 or 3 kHz, depending
on the condition (see Table 1). The formant peaks vary continuously, i.e., they
do not necessarily coincide with any one of the partials. The intensity of the
partials is determined by an attenuation function that falls off, linearly on a
logarithmic scale, on either side of the peak with a 3-dB/oct slope. On one end
of the continuum, the ]
2.2. Psychophysical paradigm
As we argued earlier [8], segregation of two concurrent signals, assumed to be produced by two independent sources, implies that the signals differ along more than one auditory dimension. This definition imposes two requirements for auditory segregation of the sources to occur: (1) a discriminable physical difference between the signals and (2) the ability to associate a value on one dimension to a value on another dimension, such as The clarinet played the standard `A' and the flute a `B' just above the standard `A'. While previous work was directed at the problem of segregation of signals differing along pairs of dimensions as divergent as spectrum and azimuthal locus (see [10], [7]), the dimensions in the present experiments, that is, fundamental frequency and vowel quality, are partially interrelated. However, as shown by Singh [11], segregation of signals that differ only with respect to pitch and timbre can also be demonstrated. The paradigm we have been using is based on the common fate rule [12] which asserts that segregation is the result of integration of common spectral features over time, or temporal features across spectrum. Here, we interpret time as pitch.
The 2AFC paradigm used in our experiments is illustrated schematically in
Figure 2. In each of the two intervals, two steady-state signals are presented
concurrently, shown as the two horizontal traces. Each signal is characterized
by the two vectors f0 and f1, representing pitch
and formant peak frequency, respectively. Thus, the two intervals depicted in
the figure differ with regard to the two vectors. In Interval 1 of the
example, Signal 1 has values {f01, f11} and Signal 2 values
{f02, f12} on the pitch and formant peak
dimensions, whereas in Interval 2 the two signals have values
{f01, f12} and {f02, f11},
respectively. Let us make the two pitch values discriminable when
presented simultaneously and assume that they are perceived as f01
being high and f02 being low.
Let us also assume that, among the two formant envelopes, the one with peak at
f11 is heard as lower than f12. In this case, the
stimulus in the trial will be perceived as the high fundamental pitch
went from high to low (while the low fundamental pitch moved from low to
high). The difference f1 = | f12 -
f11 | at which this discrimination is possible, for a given
fundamental frequency difference f0 = | f01
- f02 | and at a given d' level, will be regarded as the
threshold of auditory segregation. The crucial feature of this
definition is that, d' being a statistical measure, both
f0 and f1 should be interpreted as rms, i.e.,
wide-sense stationary quantities, at least at threshold.[5
As to the relationship between ]
3. RESULTS
3.1. Experiment 1
Results of Experiment 1 are shown for two highly experienced subjects in Figures 3a and 3b for the parametric configuration displayed in Table 1, for those five conditions in which the lowest harmonic number of the formant envelope for the low-fundamental frequency (107-Hz) promel was 8, corresponding to 856 Hz in the 1-kHz and 2568 Hz in the 3-kHz region. Let us look only at the left-hand points, i.e., for those obtained at the small f0/f0 ratio (0.271, roughly corresponding to a musical major third). At both the 1 and the 3-kHz regions, the two subjects' results are in agreement in that the Weber fractions of segregation based on formant peak differences is larger when all partials of both promels are harmonic, than when they are inharmonic in one or both. Inharmonicity, therefore, helps to segregate two promels and the effect is larger in the 1-kHz than in the 3-kHz frequency region. Comparing the results obtained at the two fundamental frequency ratios, one various trends may be observed. One should keep in mind that the f0/f0 ratio of 0.411 roughly corresponds to the musical interval of tritone, i.e., one that is less consonant (and presumably less fusible) than the major third. The comparison shows the existence of a large trade-off at the 1-kHz region when both promels consist of harmonic partials; a similar trade-off at the 3-kHz region is seen only in Subject 1's data. No trade-off, or a negative one, is visible for all conditions in which one or both promels consist of inharmonic partials. In the absence of further data, one may only want to conclude that, when inharmonic partials are present, the dimensions of formant frequency difference and fundamental frequency difference are not independent.
The differences between the results obtained inside the 1-kHz and in the 3-kHz frequency regions could derive from two sources, since, in the 3-kHz region, both the formant frequency range and the fundamental frequencies were exactly three times those in the 1-kHz region. To examine which of these two sources was more likely to have produced the results, the subjects were tested in two further conditions: with the fundamental frequencies in the 300-Hz range (used in the 3-kHz formant region in Figures 3) and formant frequencies in the 1-kHz region. To generate this type of stimulus, the harmonic numbers under the formant envelopes had to be moved downward: the lowest harmonic number in the promel with the 321-Hz fundamental was 3, corresponding to 963 Hz. Results are shown in Figures 4a and 4b for the two subjects. Comparing Fig. 4a with Fig. 3a and Fig. 4b with Fig. 3b, it becomes apparent that, although the two subjects exhibited trends different from each other (mainly in the harmonic-harmonic promel condition), their results were internally consistent. This implies that the 1-kHz vs. 3-kHz frequency region differences observed in Fig. 3a and 3b are, in fact, mainly attributable to differences in fundamental frequency, rather than formant
3.2. Experiment 2
The stimulus used in Experiment 1 has one serious shortcoming: the frequency regions of the two superimposed promels can be never made to coincide exactly because the two consist of non-overlapping sets of partials of two different fundamental frequencies. Experiment 2 was designed as an attempt to solve this difficult problem by doing away with the fixed f0 values and assigning the two promels two distinct fundamental frequency contours moving between the same extremes but in opposite directions. Figure 5 illustrates the stimulus diagram in a time-frequency plane, showing the movement of the fundamental frequency of the two superimposed promels (Signal 1 and Signal 2). Thus, instead of having the fixed f01 and f02 values, the two promels followed a V or a contour, with f01 as the highest and f02 as the lowest frequency. The two promels, therefore, always occupied the same identical formant region, with only f1 being different, following the scheme presented in Figure 1. Because of the time-varying nature of the stimulus, a comparison of high vs. low f0 is possible within one stimulus interval. Therefore, a single-interval, rather than a two-interval, forced choice procedure was adopted. An adaptive staircase procedure tracked the threshold for the f1 difference at which the two promels could be segregated.
Results are illustrated in Figures 6a and 6b for the two subjects, respectively. Disappointingly, the only deduction one can make from the data is that fundamental frequency modulation makes the task of promel segregation trivial, as long as the normalized fundamental frequency difference is larger than 0.1-0.2, regardless of the formant frequency region and the fundamental frequency. Note that only harmonic-harmonic partials were used; when even one of the two promels consisted of inharmonic partials, threshold reached the ceiling. Thus, although the experiment did provide useful information concerning the importance of frequency modulation for signal segregation, it failed to fulfill the expectations of embodying a clean stimulus paradigm to measure the quantities measured in Experiment 1.
4. DISCUSSION
The above described experiments attempted to analytically examine the processes responsible for auditory segregation of simultaneous pairs of vowels. The main objective of this exercise was to relate auditory scene analysis of speech sounds to auditory processes. In this respect, it represents an undertaking linguistically less relevant but perhaps providing a deeper insight into vowel segregation from the auditory standpoint than did the by now classic studies of [16], [17], [18]. The results portray the segregation process as being exquisitely sensitive both to the inharmonicity of partials and to frequency excursions of the fundamental, thereby confirming results by [9] on one hand, and by [19] and [20], on the other.
It is also interesting to note that the two difference dimensions, i.e., the difference in pitch and the difference in formant peak, displayed all three relations described in the Methods section. For Subject 1, whenever both of the promels consisted of inharmonic partials, the Weber fraction for formant peak resolution appeared to be independent of the pitch difference. For Subject 2, most of the time, and in some conditions containing inharmonic promels, for Subject 1, there appeared to have been a positive correlation between the two differences, suggesting that the double resolution task may have been handled by the same process. In one case, however, i.e., for pitch rates in the 100-150-Hz range and for formant regions not exceeding 1500 Hz, there was a clear trade-off between pitch resolution and spectral resolution. This instance of trade-off hints at the presence of something analogous to Gabor's [13] dfdt constant, which could happen only if pitch were truly encoded as time and formant peak as frequency.
Naturally, the results are incomplete in many ways. We will have to resolve the issue of intersubject variability: are there analytic and synthetic listeners? More importantly, however, the formant region has to explicitely overlap with specific formant regions of vowels (i.e., the F2 and F3 regions), while keeping the fundamental frequency constant. We will also have to explore in more detail the issue of harmonic vs inharmonic relationship between the two fundamental frequencies - a question of obvious musical relevance. Finally, the results will have to be compared to the predictions of various auditory models, in order to answer the question of whether the presently investigated instance of primitive segregation can be explained completely by evoking peripheral phenomena, or whether the intervention of a central agent will be deemed indispensable.
TABLE 1 f0/f0 // freq. Parameter Harmonic-Har Harmonic-Inha Inharmonic-In region monic rmonic harmonic 0.271 // 1 kHz f0 (Hz) 107 107 107 lowest harm. no. under formant 8 8 7.5 envelope f0 (Hz) 136 136 136 lowest harm. no. under formant 6 6.5 5.5 envelope 0.411 // 1 kHz f0 (Hz) 107 107 107 lowest harm. no. under formant 8 8 7.5 envelope f0 (Hz) 151 151 151 lowest harm. no. under formant 6 6.5 5.5 envelope 0.271 // 3 kHz f0 (Hz) 321 321 --- lowest harm. no. under formant 8 8 --- envelope f0 (Hz) 408 408 --- lowest harm. no. under formant 6 6.5 --- envelope 0.411 // 3 kHz f0 (Hz) 321 321 --- lowest harm. no. under formant 8 8 --- envelope f0 (Hz) 453 453 --- lowest harm. no. under formant 6 6.5 --- envelope 0.271 // 1 kHz f0 (Hz) 321 321 --- lowest harm. no. under formant 3 3 --- envelope f0 (Hz) 408 408 --- lowest harm. no. under formant 2 2.5 --- envelope 0.411 // 1 kHz f0 (Hz) 321 321 --- lowest harm. no. under formant 3 3 --- envelope f0 (Hz) 453 453 --- lowest harm. no. under formant 2 2.5 --- envelope
REFERENCES
1. Bregman, A.S. (1991). Auditory scene analysis, Cambridge, Mass.: Bradford Books (MIT Press).
2. McAdams, S. and E. Bigand, eds (1993). Thinking in sound. The cognitive psychology of human audition. , Clarendon Press: Oxford.
3. Cherry, C. (1953). Some experiments on the recognition of speech with one and with two ears. Journal of the Acoustical Society of America,. 26: 975-979.
4. van Noorden, L.P.A.S. ( 1975). Temporal coherence in the perception of tone sequences, Technische Hoogschool, Eindhoven (the Netherlands).
5. Brockx, J.P.L. and S.G. Nooteboom (1982). Intonation and the perceptual separation of simultaneous voices. Journal of Phonetics,. 10: 23-36.
6. Bregman, A.S. and P. Doehring (1984). Fusion of simultaneous tonal glides: The role of parallelness and simple frequency relations. Perception and Psychophysics,. 36: 251-256.
7. Divenyi, P.L. and S.K. Oliver (1989). Resolution of steady-state sounds in simulated auditory space. Journal of the Acoustical Society of America,. 85: 2046-2056.
8. Divenyi, P.L. (1995). Auditory segregation of concurrent signals: An operational definition. in IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.. New Paltz, New York.
9. de Cheveigné, A. (1993). Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain concellation model of auditory processing. Journal of the Acoustical Society of America,. 93: 3271-3290.
10. Perrott, D.R. (1984). Concurrent minimum audible angle: A re-examination of the concept of auditory spatial acuity. Journal of the Acoustical Society of America,. 75: 1201-1206.
11. Singh, P. (1987). Perceptual organization of complex-tone sequences: A tradeoff between pitch and timbre? Journal of the Acoustical Society of America,. 82: 886-899.
12. Bregman, A.S. (1978). The formation of auditory streams, in Attention and Performance, J. Requin, Editor, Erlbaum: Hillsdale, NJ. pp. 63-76.
13. Brillouin, L. (1962). Science and information theory, 2nd ed, New York: Academic Press.
14. Ronken, D.A. (1971). Some effects of bandwidth-duration constraints on frequency discrimination. Journal of the Acoustical Society of America,. 49: 1232-1242.
15. Terhardt, E. (1974). Pitch, consonance, and harmony. Journal of the Acoustical Society of America,. 55: 1061-1069.
16. Assmann, P.F. and A.Q. Summerfield (1989). Modeling the perception of concurrent vowels: Vowels with the same fundamental frequency. Journal of the Acoustical Society of America,. 85: 327-338.
17. Assmann, P.F. and A.Q. Summerfield (1990). Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies. Journal of the Acoustical Society of America,. 88: 680-697.
18. Summerfield, A.Q. and P.F. Assmann (1991). Perception of concurrent vowels: Effects of harmonic misalignment and pitch-period asynchrony. Journal of the Acoustical Society of America,. 89: 1364-1377.
19. McAdams, S. (1989). Segregation of concurrent sounds I: Effects of frequency modulation coherence. Journal of the Acoustical Society of America,. 86: 2148-2159.
20. Marin, C.M.H. and S. McAdams (1991). Segregation of concurrent sounds II: Effects of spectral envelope tracing, frequency modulation coherence, and frequency modulation width. Journal of the Acoustical Society of America,. 89: 341-351.
NOTES