PSYCHOPHYSICAL EXPERIMENTS ON

AUDITORY SCENE ANALYSIS

Two short steps toward measuring the segregation of vowels

Pierre L. Divenyi

Speech and Hearing Research Facility

Veterans Affairs Medical Center

Martinez, CA 94553

PDivenyi@ucdavis.edu

ABSTRACT

Auditory segregation of simultaneous vowel-like sounds relies on the perceptual separation of fundamental frequency and that of formant envelopes. These two tasks appear to interact on a variety of levels: the spectral position of the formant with regard to fundamental frequency, the harmonicity of the components, the distance between the two fundamental frequencies, and the fundamental frequency contour. Results indicate a rather clear trade-off relation between fundamental frequency resolution and formant peak resolution, i.e., between residue and timbre.

1. INTRODUCTION[1]

Auditory scene analysis ([1]) refers to the perceptual decomposition of a complex auditory display into its elements, i.e., its component events. The process is assumed to work by grouping together spectrally, temporally, and/or spatially coherent parts of the display, with the aim of recover the output of the separate sources that create it. During the past decade, the study of auditory scene analysis has attracted a number of investigators (for a review, see [2]). Research in this area is motivated by interest in solving two puzzles: how the auditory nervous system accomplishes the task and what algorithm can be designed to do the same. While the challenge to solve the two puzzles seems about equal, significantly more effort (and financial support) has been directed to research toward the engineering/technological than to the behavioral/neurophysiological solution.

One issue central to auditory scene analysis is the perceptual separation of concurrently active voice sources, i.e., the “cocktail-party effect”-a problem area of equal interest to engineers and auditory scientists ever since Cherry [3] has coined the term. Perhaps the major stumbling block impeding behavioral experiments in the area is that few listening experiments have been able to investigate voice separation within the framework of objective psychophysical paradigms: most studies in this area are based on introspective evaluation of the auditory stimulus (e.g., [4], [5], [6])2[1]. The present paper follows earlier attempts by the author to introduce an objective paradigm measuring auditory separation of simultaneous pairs of events ([7], [8]), and to focus on a reduced version of the problem of perceptually separating two concurrent speech sources - the perceptual segregation of pairs of vowel-like sounds.

2. METHODS

The methods must address two practical issues: (1) the reduction of the complex multi-voice stimulus present in the “cocktail-party” effect to a stimulus configuration that will allow the results to be interpretable in terms of known, or suspected, auditory processes, and (2) choice of an objective psychophysical paradigm which can be portrayed as measuring separability of simultaneous objects.

2.1. The stimuli

The stimuli selected reduce the problem of simultaneous voices to one of superimposed vowels. Concurrent vowels uttered by different speakers differ with respect to two main discrete attributes: fundamental frequency f0 and formant structure, the latter expressed as the spectral envelope of the vowel. In the stimuli of the present experiments, the formant structure is further reduced to the spectral envelope of a single formant with peak at frequency f1; the number of harmonic or inharmonic partials defining the formant is a parameter. Inharmonic partials, though not normally occurring in spoken vowels, are included because of their importance for determining pitch strength, as well as because of their theoretical relevance in computational auditory scene analysis (e.g., [9]). Furthermore, f0 is held constant, in Experiment 1, or it is allowed to vary, in Experiment 2, according to a defined contour pattern. We will call this one-formant vowel a “promel”3; each stimulus consists of two superimposed promels.

Within each block of trials of Experiment 1, the two promels have fixed f0 values. The f1 formant peak varies according to an adaptive staircase paradigm. The number of partials is fixed at 7 and the frequency difference of neighboring partials is always equal to one of the f0 values. The serial number of the lowest partial is determined by the experimental condition and it is chosen to maximize the overlap of the spectral envelopes of the two promels.[4 In some conditions, both promels consist of consecutive harmonic (i.e., integral multiple) partials of the fundamental. In other conditions, either one or both promels are inharmonic; in these cases they are generated from partials which are midway between consecutive harmonics. The seven partials are summed in pseudo-random phase.

Varying in accordance with the adaptive method are the formant envelopes of both promels. The formant region chosen is either around 1 or 3 kHz, depending on the condition (see Table 1). The formant peaks vary continuously, i.e., they do not necessarily coincide with any one of the partials. The intensity of the partials is determined by an attenuation function that falls off, linearly on a logarithmic scale, on either side of the peak with a 3-dB/oct slope. On one end of the continuum, the ]f1 peaks of the two promels are identical (i.e., they are indiscriminable), whereas, at the other end of the continuum, they coincide with the lowest of the seven partials in one, and the highest partial in the other promel (i.e., they are maximally discriminable). The formant envelopes of the two simultaneously presented promels are illustrated in Figure 1. Thus, the stimulus in each interval of any given two-alternative forced-choice trial consists of the mixture of two promels, each of which has its own fixed fundamental frequency f01 or f02 and a formant envelope peaking either at frequency f11 or at f12. The total power of the two promels is always made identical. Stimulus duration throughout the experiments was fixed at 100 ms.

2.2. Psychophysical paradigm

As we argued earlier [8], segregation of two concurrent signals, assumed to be produced by two independent sources, implies that the signals differ along more than one auditory dimension. This definition imposes two requirements for auditory segregation of the sources to occur: (1) a discriminable physical difference between the signals and (2) the ability to associate a value on one dimension to a value on another dimension, such as “The clarinet played the standard `A' and the flute a `B' just above the standard `A'.” While previous work was directed at the problem of segregation of signals differing along pairs of dimensions as divergent as spectrum and azimuthal locus (see [10], [7]), the dimensions in the present experiments, that is, fundamental frequency and vowel quality, are partially interrelated. However, as shown by Singh [11], segregation of signals that differ only with respect to pitch and timbre can also be demonstrated. The paradigm we have been using is based on the “common fate” rule [12] which asserts that segregation is the result of integration of common spectral features over time, or temporal features across spectrum. Here, we interpret time as pitch.

The 2AFC paradigm used in our experiments is illustrated schematically in Figure 2. In each of the two intervals, two steady-state signals are presented concurrently, shown as the two horizontal traces. Each signal is characterized by the two vectors f0 and f1, representing pitch and formant peak frequency, respectively. Thus, the two intervals depicted in the figure differ with regard to the two vectors. In Interval 1 of the example, Signal 1 has values {f01, f11} and Signal 2 values {f02, f12} on the pitch and formant peak dimensions, whereas in Interval 2 the two signals have values {f01, f12} and {f02, f11}, respectively. Let us make the two pitch values discriminable when presented simultaneously and assume that they are perceived as f01 being “high” and f02 being “low”. Let us also assume that, among the two formant envelopes, the one with peak at f11 is heard as lower than f12. In this case, the stimulus in the trial will be perceived as “the high fundamental pitch went from high to low (while the low fundamental pitch moved from low to high)”. The difference f1 = | f12 - f11 | at which this discrimination is possible, for a given fundamental frequency difference f0 = | f01 - f02 | and at a given d' level, will be regarded as the threshold of auditory segregation. The crucial feature of this definition is that, d' being a statistical measure, both f0 and f1 should be interpreted as rms, i.e., wide-sense stationary quantities, at least at threshold.[5

As to the relationship between ]f0 and f1, there are three possible outcomes, each of which having its proper theoretical interpretation. If the two vectors fo , f1 are orthogonal, it follows that f0 = kf1, k=0, and vice-versa. This would mean that the pitch resolution and formant peak (i.e., spectral) resolution processes work independently in an auditory segregation task. If the two vectors are positively correlated, then the same equation should hold with k0; this outcome would signify that the two processes are either driven by some common higher-up process or that they are, in fact, one and the same. The third outcome is the has the greatest theoretical interest. If the results suggest that f0f1= k, this would mean that pitch resolution and formant resolution should be governed by some form of the law of trade-off. Such a relation describes a psychoacoustics-bound projection of the Law of Physical Observation (i.e., the Heisenberg Uncertainty Principle, see [13]), shown to hold for various pairs of auditory dimensions (e.g., [14], [10], [7]), including the trade between what Terhardt [15] describes as “virtual” vs. “spectral” pitch [11]. What such an outcome would mean is that, in the processing of concurrent auditory signals, pitch resolution and spectral resolution are based on the analysis of the very same information packet, although the mechanisms accessing this packet may be different for the two tasks.

3. RESULTS

3.1. Experiment 1

Results of Experiment 1 are shown for two highly experienced subjects in Figures 3a and 3b for the parametric configuration displayed in Table 1, for those five conditions in which the lowest harmonic number of the formant envelope for the low-fundamental frequency (107-Hz) promel was 8, corresponding to 856 Hz in the 1-kHz and 2568 Hz in the 3-kHz region. Let us look only at the left-hand points, i.e., for those obtained at the small f0/f0 ratio (0.271, roughly corresponding to a musical major third). At both the 1 and the 3-kHz regions, the two subjects' results are in agreement in that the Weber fractions of segregation based on formant peak differences is larger when all partials of both promels are harmonic, than when they are inharmonic in one or both. Inharmonicity, therefore, helps to segregate two promels and the effect is larger in the 1-kHz than in the 3-kHz frequency region. Comparing the results obtained at the two fundamental frequency ratios, one various trends may be observed. One should keep in mind that the f0/f0 ratio of 0.411 roughly corresponds to the musical interval of tritone, i.e., one that is less consonant (and presumably less fusible) than the major third. The comparison shows the existence of a large trade-off at the 1-kHz region when both promels consist of harmonic partials; a similar trade-off at the 3-kHz region is seen only in Subject 1's data. No trade-off, or a negative one, is visible for all conditions in which one or both promels consist of inharmonic partials. In the absence of further data, one may only want to conclude that, when inharmonic partials are present, the dimensions of formant frequency difference and fundamental frequency difference are not independent.

The differences between the results obtained inside the 1-kHz and in the 3-kHz frequency regions could derive from two sources, since, in the 3-kHz region, both the formant frequency range and the fundamental frequencies were exactly three times those in the 1-kHz region. To examine which of these two sources was more likely to have produced the results, the subjects were tested in two further conditions: with the fundamental frequencies in the 300-Hz range (used in the 3-kHz formant region in Figures 3) and formant frequencies in the 1-kHz region. To generate this type of stimulus, the harmonic numbers under the formant envelopes had to be moved downward: the lowest harmonic number in the promel with the 321-Hz fundamental was 3, corresponding to 963 Hz. Results are shown in Figures 4a and 4b for the two subjects. Comparing Fig. 4a with Fig. 3a and Fig. 4b with Fig. 3b, it becomes apparent that, although the two subjects exhibited trends different from each other (mainly in the harmonic-harmonic promel condition), their results were internally consistent. This implies that the 1-kHz vs. 3-kHz frequency region differences observed in Fig. 3a and 3b are, in fact, mainly attributable to differences in fundamental frequency, rather than formant

3.2. Experiment 2

The stimulus used in Experiment 1 has one serious shortcoming: the frequency regions of the two superimposed promels can be never made to coincide exactly because the two consist of non-overlapping sets of partials of two different fundamental frequencies. Experiment 2 was designed as an attempt to solve this difficult problem by doing away with the fixed f0 values and assigning the two promels two distinct fundamental frequency contours moving between the same extremes but in opposite directions. Figure 5 illustrates the stimulus diagram in a time-frequency plane, showing the movement of the fundamental frequency of the two superimposed promels (“Signal 1” and “Signal 2”). Thus, instead of having the fixed f01 and f02 values, the two promels followed a V or a contour, with f01 as the highest and f02 as the lowest frequency. The two promels, therefore, always occupied the same identical formant region, with only f1 being different, following the scheme presented in Figure 1. Because of the time-varying nature of the stimulus, a comparison of high vs. low f0 is possible within one stimulus interval. Therefore, a single-interval, rather than a two-interval, forced choice procedure was adopted. An adaptive staircase procedure tracked the threshold for the f1 difference at which the two promels could be segregated.

Results are illustrated in Figures 6a and 6b for the two subjects, respectively. Disappointingly, the only deduction one can make from the data is that fundamental frequency modulation makes the task of promel segregation trivial, as long as the normalized fundamental frequency difference is larger than 0.1-0.2, regardless of the formant frequency region and the fundamental frequency. Note that only harmonic-harmonic partials were used; when even one of the two promels consisted of inharmonic partials, threshold reached the ceiling. Thus, although the experiment did provide useful information concerning the importance of frequency modulation for signal segregation, it failed to fulfill the expectations of embodying a “clean” stimulus paradigm to measure the quantities measured in Experiment 1.

4. DISCUSSION

The above described experiments attempted to analytically examine the processes responsible for auditory segregation of simultaneous pairs of vowels. The main objective of this exercise was to relate auditory scene analysis of speech sounds to auditory processes. In this respect, it represents an undertaking linguistically less relevant but perhaps providing a deeper insight into vowel segregation from the auditory standpoint than did the by now classic studies of [16], [17], [18]. The results portray the segregation process as being exquisitely sensitive both to the inharmonicity of partials and to frequency excursions of the fundamental, thereby confirming results by [9] on one hand, and by [19] and [20], on the other.

It is also interesting to note that the two difference dimensions, i.e., the difference in pitch and the difference in formant peak, displayed all three relations described in the Methods section. For Subject 1, whenever both of the promels consisted of inharmonic partials, the Weber fraction for formant peak resolution appeared to be independent of the pitch difference. For Subject 2, most of the time, and in some conditions containing inharmonic promels, for Subject 1, there appeared to have been a positive correlation between the two differences, suggesting that the double resolution task may have been handled by the same process. In one case, however, i.e., for pitch rates in the 100-150-Hz range and for formant regions not exceeding 1500 Hz, there was a clear trade-off between pitch resolution and spectral resolution. This instance of trade-off hints at the presence of something analogous to Gabor's [13] dfdt constant, which could happen only if pitch were truly encoded as time and formant peak as frequency.

Naturally, the results are incomplete in many ways. We will have to resolve the issue of intersubject variability: are there “analytic” and “synthetic” listeners? More importantly, however, the formant region has to explicitely overlap with specific formant regions of vowels (i.e., the F2 and F3 regions), while keeping the fundamental frequency constant. We will also have to explore in more detail the issue of harmonic vs inharmonic relationship between the two fundamental frequencies - a question of obvious musical relevance. Finally, the results will have to be compared to the predictions of various auditory models, in order to answer the question of whether the presently investigated instance of “primitive segregation” can be explained completely by evoking peripheral phenomena, or whether the intervention of a central agent will be deemed indispensable.

                                           TABLE 1                                   
f0/f0  // freq.      Parameter           Harmonic-Har  Harmonic-Inha  Inharmonic-In  
region                                   monic         rmonic         harmonic       
0.271  //  1 kHz     f0  (Hz)                                                        
                                         107           107            107            
                     lowest harm. no.                                                
                     under formant       8             8              7.5            
                     envelope                                                        
                     f0  (Hz)                                                        
                                         136           136            136            
                     lowest harm. no.                                                
                     under formant       6             6.5            5.5            
                     envelope                                                        
                                                                                     
0.411   //  1 kHz    f0   (Hz)                                                       
                                         107           107            107            
                     lowest harm. no.                                                
                     under formant       8             8              7.5            
                     envelope                                                        
                     f0  (Hz)                                                        
                                         151           151            151            
                     lowest harm. no.                                                
                     under formant       6             6.5            5.5            
                     envelope                                                        
                                                                                     
0.271   //  3 kHz    f0  (Hz)                                                        
                                         321           321            ---            
                     lowest harm. no.                                                
                     under formant       8             8              ---            
                     envelope                                                        
                     f0  (Hz)                                                        
                                         408           408            ---            
                     lowest harm. no.                                                
                     under formant       6             6.5            ---            
                     envelope                                                        
                                                                                     
0.411   //  3 kHz    f0   (Hz)                                                       
                                         321           321            ---            
                     lowest harm. no.                                                
                     under formant       8             8              ---            
                     envelope                                                        
                     f0  (Hz)                                                        
                                         453           453            ---            
                     lowest harm. no.                                                
                     under formant       6             6.5            ---            
                     envelope                                                        
                                                                                     
0.271   //  1 kHz    f0  (Hz)                                                        
                                         321           321            ---            
                     lowest harm. no.                                                
                     under formant       3             3              ---            
                     envelope                                                        
                     f0  (Hz)                                                        
                                         408           408            ---            
                     lowest harm. no.                                                
                     under formant       2             2.5            ---            
                     envelope                                                        
                                                                                     
0.411   //  1 kHz    f0   (Hz)                                                       
                                         321           321            ---            
                     lowest harm. no.                                                
                     under formant       3             3              ---            
                     envelope                                                        
                     f0  (Hz)                                                        
                                         453           453            ---            
                     lowest harm. no.                                                
                     under formant       2             2.5            ---            
                     envelope                                                        

REFERENCES

1. Bregman, A.S. (1991). Auditory scene analysis, Cambridge, Mass.: Bradford Books (MIT Press).

2. McAdams, S. and E. Bigand, eds (1993). Thinking in sound. The cognitive psychology of human audition. , Clarendon Press: Oxford.

3. Cherry, C. (1953). Some experiments on the recognition of speech with one and with two ears. Journal of the Acoustical Society of America,. 26: 975-979.

4. van Noorden, L.P.A.S. ( 1975). Temporal coherence in the perception of tone sequences, Technische Hoogschool, Eindhoven (the Netherlands).

5. Brockx, J.P.L. and S.G. Nooteboom (1982). Intonation and the perceptual separation of simultaneous voices. Journal of Phonetics,. 10: 23-36.

6. Bregman, A.S. and P. Doehring (1984). Fusion of simultaneous tonal glides: The role of parallelness and simple frequency relations. Perception and Psychophysics,. 36: 251-256.

7. Divenyi, P.L. and S.K. Oliver (1989). Resolution of steady-state sounds in simulated auditory space. Journal of the Acoustical Society of America,. 85: 2046-2056.

8. Divenyi, P.L. (1995). Auditory segregation of concurrent signals: An operational definition. in IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics.. New Paltz, New York.

9. de Cheveigné, A. (1993). Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain concellation model of auditory processing. Journal of the Acoustical Society of America,. 93: 3271-3290.

10. Perrott, D.R. (1984). Concurrent minimum audible angle: A re-examination of the concept of auditory spatial acuity. Journal of the Acoustical Society of America,. 75: 1201-1206.

11. Singh, P. (1987). Perceptual organization of complex-tone sequences: A tradeoff between pitch and timbre? Journal of the Acoustical Society of America,. 82: 886-899.

12. Bregman, A.S. (1978). The formation of auditory streams, in Attention and Performance, J. Requin, Editor, Erlbaum: Hillsdale, NJ. pp. 63-76.

13. Brillouin, L. (1962). Science and information theory, 2nd ed, New York: Academic Press.

14. Ronken, D.A. (1971). Some effects of bandwidth-duration constraints on frequency discrimination. Journal of the Acoustical Society of America,. 49: 1232-1242.

15. Terhardt, E. (1974). Pitch, consonance, and harmony. Journal of the Acoustical Society of America,. 55: 1061-1069.

16. Assmann, P.F. and A.Q. Summerfield (1989). Modeling the perception of concurrent vowels: Vowels with the same fundamental frequency. Journal of the Acoustical Society of America,. 85: 327-338.

17. Assmann, P.F. and A.Q. Summerfield (1990). Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies. Journal of the Acoustical Society of America,. 88: 680-697.

18. Summerfield, A.Q. and P.F. Assmann (1991). Perception of concurrent vowels: Effects of harmonic misalignment and pitch-period asynchrony. Journal of the Acoustical Society of America,. 89: 1364-1377.

19. McAdams, S. (1989). Segregation of concurrent sounds I: Effects of frequency modulation coherence. Journal of the Acoustical Society of America,. 86: 2148-2159.

20. Marin, C.M.H. and S. McAdams (1991). Segregation of concurrent sounds II: Effects of spectral envelope tracing, frequency modulation coherence, and frequency modulation width. Journal of the Acoustical Society of America,. 89: 341-351.

NOTES