Spectral Envelope Estimation and Representation for
Sound Analysis-Synthesis
Diemo Schwarz
(schwarz@ircam.fr) ·
Xavier Rodet (rod@ircam.fr)
Abstract
Spectral envelopes very useful in sound analysis and synthesis:
- Connection with production and perception models
- Ability to capture and to manipulate important properties of sound using easily understandable "musical" parameters
- Estimation and representation requirements
- Strengths and weaknesses of estimation methods (LPC, cepstrum, discrete cepstrum) and representation methods (filter coefficients, sampled, break-point functions, splines, formants)
- Proposed high-level approach to handling
- Software developed at Ircam makes important applications of spectral envelopes in the domain of additive analysis-synthesis possible.
Properties of Spectral Envelopes
Envelope fit: A spectral envelope is a curve which envelopes the magnitude STS, i.e. it wraps tightly around it, linking the peaks of the sinusoidal partials or passing close to the maxima of non-sinusoidal spectra.
Smoothness: A certain smoothness of the curve is required: it should not oscillate irratically (fluctuate too wildly over frequency), but give a general idea of the distribution of energy of the signal over frequency.
Adaptation to fast spectrum variations: A spectral envelope is defined relative to a short segment of the signal (typically between 10 and 50 ms). When the STS varies rapidly from one analysis frame to the next, the spectral envelope should follow precisely.
Estimation
Requirements
The properties of spectral envelopes must be satisfied, plus the requirement of:
Robustness: The estimation should yield precise and smooth spectral envelop-es for a wide range of signals with very different characteristics.
Methods
- Linear Predictive Coding:
All-pole filter coefficients
Cepstrum: Smoothes the STS by low-pass filtering of log magnitude. Cepstrum coefficients.
Discrete Cepstrum: Computed from distinct points in frequency-amplitude plane, e.g. spectral peaks of a STS, sinusoidal partials.
Improvement of preciseness using a nonlinear frequency scale (e.g. logarithmic or mel scale) reflecting the frequency resolution of the human ear (coarser for high frequencies).
Comparison
The figure shows weaknesses of LPC and cepstrum estimation: Both descend down into the space between the partials for high-pitched sounds. Low-order LPC estimation is too smooth. Cepstrum averages the spectrum, does not link the peaks either.
These problems are avoided by the discrete cepstrum method. Nevertheless, LPC and cepstrum are still very well applicable to the residual noise, where the discrete cepstrum cannot be used.
Improvement of robustness of estimation using a composite envelope: discrete-cepstrum from voiced part below maximum voiced frequency, LPC above [1][2].
Representation
A unified high-level representation for use in musical synthesis should fulfill the requirements:
Preciseness: Describe an arbitrary spectral envelope (from estimation or given manually) as precisely as possible.
Stability: Small changes, e.g. noise, must not lead to large changes in representation, but must result in equally small changes
Locality in frequency: Achieve a local change of a spectral envelope by simple change in parameters.
Flexibility and ease of manipulation: Allow various manipulations, easy to specify, with exactly defined desired outcome, effect on spectrum easily understood.
Speed of synthesis: Representation usable for synthesis as directly as possible, without first converting to a different form at high computational costs.
Space in memory: The representation must not take up too much space.
Manual input: The representation should be easy to specify manually or by textual input of parameters.
Proposed Representations
Filter coefficients: Cepstrum or one of the several types of LPC coefficients.
Sampled representation: The spectral envelope is sampled at n frequency points, equidistant or nonlinearly spaced.
Geometric representations: Piece-wise linear, splines (quadratic or cubic interpolation), points placed on the maxima, minima, and inflection points of the envelope.
Formants: Resonances in a resonator (vocal tract), maxima of the spectral envelope. Combine by multiplication or addition, serial or parallel structure of synthesis filters.
Formant waveforms (FOFs): Represent a formant as an elementary wave-form. FOFs add up to build spectrum.
Basic formants: A simpler way to describe formants of a spectral envelope using the parameters center frequency, amplitude and bandwidth and addition.
Fuzzy formants: Approximate locations of formants as regions within a sampled spectral envelope where a formant is assumed to exist.
Comparison of Representations
Scores (++, +, o, -, --) indicating fulfillment of requirements.
Represen- tation |
Stability |
Locality |
Flexibility / Ease of Manipulation |
Speed of Synthesis TD/FD |
Space |
Manual Input |
Filter Coef. |
++ |
- |
-- /- |
++ / o |
+ |
-- |
Sampled |
++ |
++ |
++ / + |
- / ++ |
o |
+ |
Geometric |
- |
+ |
+ / ++ |
- / + |
+ |
++ |
Formants |
- |
+ |
++ / ++ |
+ / o |
++ |
++ |
Synthesis
In synthesis from scratch, a spectral envelope is given directly as part of the synthesis parameters.
In resynthesis, an input signal is modified so as to respect the desired spectral envelope.
Methods
The spectral envelope has to be converted to filter coefficients for time-domain filtering, or to a transfer function for frequency-domain filtering (e.g. with SuperVP).
Additive synthesis: Sum of sinusoidal partials with amplitudes according to the sinusoidal spectral envelope and of a residual noise the spectral density of which is given by the noise spectral envelope (filtering white gaussian noise)
FFT-1 method of additive synthesis: Allows a speed gain of 10 to 30. Applying the sinusoidal spectral envelope is straightforward. Synthesizing residual according to the noise spectral envelope is easy and inexpensive: just add random values in the desired frequency bins.
Applications
The proposed high-level approach to spectral envelopes can simplify the problem of controlling sinusoidal partials for additive synthesis, and manipulating them in a sensible way [3].
- Drastically reduced number of parameters
- Parameter sets which are easily understandable (e.g. formants)
- Independent frequency and amplitude control
- Modeling the residual noise part by filtering white noise with spectral envelopes renders this component of sound accessible to manipulation.
- Unified high-level handling of noise and harmonic parts
- Manipulation can affect both parts synchronously, if this is desired.
- A function library and programs have been developed at Ircam [4]. They allow spectral envelope estimation and their application to sound transformation and synthesis.
- Sinusoidal and noise spectral envelopes are used in the real-time synthesis system jMax using the FFT-1 method [5].
Application to the Singing Voice
Spectral envelopes are necessary for modification and synthesis of the singing voice in a sensible manner.
Many aspects of the expressivity of the singing voice depend on the spectral envelope (e.g. spectral tilt).
A new type of high quality singing voice synthesis is possible:
- To preserve the rapid changes in transients (e.g. plosives), and the noise in fricatives, these are best synthesised with the harmonic sinusoids+noise model, controlled by spectral envelopes in sampled representation.
- For precise formant locations in the steady part of vowels, the formant representation is used.
- With morphing between fuzzy and precise formants, it is then possible to interface the excellent generation of vowels by formant synthesis with the flexibility of general additive synthesis, for instance in the generalized graphical synthesis control program Diphone Studio.
Conclusion
Spectral envelopes allow to influence the timbre of a sound to a great degree, composers can obtain a desired effect by the use of high-level representations.
To the performer, the real-time application of spectral envelope manipulation greatly enhances expressivity through easily understandable and "musical" parameters.
Each representation has its strong points
-> use and combine all of them in an object-oriented class hierarchy.
- All the programs developed at Ircam use the standardized, open, and extensible Sound Description Interchange Format (SDIF) [6][7] to facilitate the exchange of data with well-defined semantics between programs, hardware architectures, and institutions. With more and more analysis-synthesis tools being ported to SDIF, this will create important synergetic effects in research and creation.
- See also the chapter Spectral Envelopes and Additive+Residual Analysis-Synthesis in the forthcoming book The Sound of Music, J. Beauchamp editor.
Bibliography
[1] Y. Stylianou, J. Laroche, E. Moulines. High Quality Speech Modification based on a Harmonic+Noise Model. Proc. EUROSPEECH, 1995.
[2] Marine Campedel-Oudot. Étude du modèle sinusoides et bruit pour le traitement de la parole. Estimation robuste de l'enveloppe spectrale. Thèse, ENST, Paris, 1998.
[3] A. Freed, X. Rodet, Ph. Depalle. Synthesis and Control of Hundreds of Sinusoidal Partials on a Desktop Computer without Custom Hardware. ICSPAT, 1992.
[4] Diemo Schwarz. Spectral Envelopes in Sound Analysis and Synthesis. Diplomarbeit, Universität Stuttgart, Fakultät Informatik, Germany, 1998.
[5] F. Déchelle, M. DeCecco, E. Maggi, N. Schnell. jMax Recent Developments. Proc. ICMC, 1999.
[6] Dominique Virolle, Diemo Schwarz. Xavier Rodet. Sound Description Interchange Format. Web page.
http://www.ircam.fr/sdif
[7] M. Wright, A. Chaudhary, A. Freed, S. Khoury, D. Wessel. Audio Applications of the Sound Description Interchange Format Standard. AES 107th convention, 1999.