3.4 The Source-Filter Model of Speech Production

Because one of the main applications of spectral envelopes will be the improvement of the analysis and synthesis of the singing voice, we will take a look at a somewhat simplifying, but practical model of speech production. It is well grounded in phonetic and phonological research and applies to the singing voice also. For more details see [CY96].

**Figure 2.31:** Saggital cross section of the vocal tract
$\begin{figure}\centerline{\epsfbox{pics/vocal-tract.eps}} \end{figure}$

While only pulmonal initiation of speech (generation of an airstream by releasing air from the lungs) is of interest here, two types of phonation are mainly used: We distinguish two types of voicing :

Voiced speech: is generated by the modulation of the airstream of the lungs by periodic opening and closing of the vocal folds in the glottis or larynx. This is used e.g. for vowels and nasal consonants like /m/, /n/.
Unvoiced speech: is generated by a constriction of the vocal tract (see figure 2.31) narrow enough to cause turbulent airflow, which results in noise, e.g. in fricatives like /f/, /s/, or breathy voice (where the constriction is in the glottis). Unvoiced plosives like /p/, /t/, /k/ fall into this category, too.

Of course, these two types can be applied simultaneously, e.g. in voiced fricatives like /z/ or /v/, or in the voiced plosives /b/, /d/, /g/.

**Figure 2.32:** Source-filter model of speech production
$\begin{figure}\centerline{\epsfbox{pics/source-filter.eps}} \end{figure}$

From a signal processing point of view, this behaviour can be modeled by a linear system (see section 2.1.4) as shown in figure 2.32. The source is modeled by either an impulse train for the voiced speech component, or a random signal for the unvoiced component, yielding an excitation signal e(n) with Fourier transform E(z). The actual source is the waveform of the vibration of the vocal folds or the turbulence caused by a constriction. Its spectrum is obtained by filtering with the source filter S(z). Then, the effect of the shape of the vocal tract is modeled by V(z). Finally, the radiation characteristics of the lips are taken into account by L(z).

By the associativity of linear systems (equation (2.16)), these three filters can be combined to one single filter H(z) by multiplication of their respective transfer functions:

H(z) = S(z) V(z) L(z)

While the source spectrum S(z) and lip radiation L(z) are mostly constant and well known a priori, the vocal tract transfer function V(z) is the characteristic part to determine articulation and thus the content of the speech being uttered. It deserves therefore our special attention and a closer look how it can be modeled adequately.

**Figure 2.33:** Acoustic tube model of the vocal tract
$\begin{figure}\centerline{\epsfbox{pics/acoustic-tube.eps}} \end{figure}$

The Acoustic Tube Model

The vocal tract can be viewed as an acoustic tube of varying diameter. We can abstract from its curvature and divide it into cylindrical sections of equal width as shown in figure 2.33 (to be rotated around the horizontal axis). Depending on the shape of the acoustic tube (mainly influenced by tongue position), a sound wave travelling through it will be reflected in a certain way so that interferences will generate resonances at certain frequencies. These resonances are called formants . Their location largely determines the speech sound that is heard.

Like in every model, some details of the complexity of the vocal tract have been omitted. First of all, the nasal tract is completely ignored. This second cavity is shaped very irregularly and introduces additional resonances and anti-resonances (nasal zeros ), because of the effect of coupling. Fortunately, the zeros are not vital for the recognition of the speech sound. ^3.8 Next, certain speech sounds like laterals (e.g. /l/) have a tongue configuration which is not at all well described by a simple acoustic tube. Also (non-linear) coupling effects between the vocal tract and the glottis are not taken into account. Finally, the model neither respects the viscuosity of the walls of the vocal tract, nor damping that occurs.

Some of these drawbacks apply also to the source-filter model, but taken as a model of the speech sound, rather than the speech production, it serves remarkably well. Especially, it provides us with a definition of a spectral envelope for the voice: The spectral envelope is the transfer function of the filter part H(z) of the source-filter model, as in equation (2.29). Thus, to estimate the spectral envelope, we need to separate H(z) from E(z), the Fourier transform of the excitation signal. Because the final speech signal is a convolution of the source signal with the impulse response of the filter H(z), this is called deconvolution . Chapter 3 will present some methods to accomplish this.

The source-filter model applies to a large class of instruments also, especially those which use a resonating body to amplify the oscillations of a source. For example in string instruments like the guitar or violin, the corpus will have resonant frequencies (and anti-resonances), such that it acts as a filter (albeit constant over time, contrary to the voice). Also in wind instruments like the trumpet, the spectral envelope will be highly independent of the pitch, but varying with playing style (dynamics, lip pressure and stiffness), and it will determine the timbre to a large degree.

Next: 3.5 The Software-Environment at Up: 3. Basic Concepts Previous: 3.3 Spectral Envelopes

Diemo Schwarz
1998-09-07