2.1.1 Digitizing sound
An acoustical signal p(t) can be converted to an electrical signal
e(t) and vice versa using microphones and loudspeakers. The
electrical signal e(t) can be represented in a digital format as a
series of digitally coded integers s(n). To convert e(t) to a
digital signal a two step quantification is necessary. First, the
signal will be sampled at fixed time intervals Ts. The value of
the sampling frequency
Fs = 1 / Ts is determined by the highest
frequency component Fm present in the signal. According to the
Nyquist theorem the sampling frequency should be at least twice the
maximum frequency contained in the signal:
Fs > 2Fm. Second, the
values e(nTs) must be quantized to fit an integer value. The
simple method would be to round the values e(nTs) to the nearest
integer value to obtain the series s(n). The digitizing is realized by
a analog-to-digital converter (ADC). With a digital-to-analog
converter (DAC) the digital signal s(n) can be changed back to an
electrical signal and fed to a loudspeaker
[Mat69].
This correspondence between series of numbers and sounds allows us to
use digital methods and devices for the synthesis and manipulation of
sounds. Instead of using existing, digitized sounds, we can apply
digital synthesis techniques to create new, unexisting sounds
artificially. A whole new field of undiscovered sounds opens up. The
sound samples can be calculated out of time, i.e. the sound is
calculated in its whole length and is played when the synthesis is
finished. Or the sound can be generated in real-time: the synthesis
and the playing are interleaved such that the sound can be heard at
the time of the calculation.
More than one sound signal can be generated at the same time. If a
stereo signal is desired, two signals must be synthesized. In the
general case N signals can be generated. Each signal is sometimes
called a channel or a track. The convenient and most used way to
stock a multi-channel sound is to interleave the samples of the
distinct channels. We will call a frame the set of N samples
of the N channels signal.
Since any audible sound can be represented as a series of sampled
values, theoretically any sound can be generated by the computer. The
digital representation of a sound, however, asks for an enormous
amount of data and it is therefore inconceivable to generate this data
by hand. To describe and facilitate the generation of sound synthesis models are used. These models take a number of control parameters and produce a digitized sound signal.
Two main categories of models can be distinguished. The first
category models the sound signal and comes historically first. This
category is generally called signal modeling.
Signal models describe the acoustical structure of the sound. The
desired waveform is described using signal processing techniques such
as oscillators, filters, and amplifiers.
The second category models the acoustical causes of sound
generation. This category is generally called physical modeling.
Physical models depart from a description of the mechanical causes that
produce the sound. These models describe the sound production
mechanisms and not the resulting waveform [Ris93],
[DP93].
There are different techniques to translate physical descriptions of
music instruments into a digital synthesis technique. Three main
techniques can be distinguished: the wave-guide models
[Smi96,Smi97], the modal descriptions of
instruments [MA93,EIC95], and the mechanical
models [CLF93]. In this text we will only consider
the signal models.
2.1.2 Signal models
Among the signal models some of the well known techniques are
[DP93,Roa96]:
- Sampling,
- Wave table synthesis,
- Additive synthesis,
- Phase vocoder,
- Granular synthesis,
- Formantic waveform synthesis,
- Subtractive synthesis,
- Wave shaping,
- Frequency modulation,
The sampling technique stores concrete sounds, often recordings of
musical instruments, into tables. The synthesis consists of
reproducing the sound by reading the values in the table. The
frequency of the sound can be modified by changing the speed at which
the values in the table are read, and an amplitude curve can be
applied to change the dynamics and duration. The technique is very
simple but effective to produce rich sounds quickly.
Like sampling, wave table synthesis reads the values stocked in a
table to produce the sound. The table, however, contains only one
period of the wave form of the signal. It is a fast technique to
implements sine-wave, rectangular, or triangular wave oscillators
[Mat69]. The waveform stocked in the table can have any
shape, and may vary during the synthesis.
In additive synthesis, complex sounds are produced by the
superposition of elementary sounds. The goal is for the constituent
sounds to fuse together, and the result to be perceived as a single
sound. Any almost periodic sound can be approximated as a sum of
sinusoids, each sinusoid controlled in frequency and amplitude.
Additive synthesis provides great generality, allowing to produce
almost any sound. But a problem arises because of the large amount of
data to be specified for each note: two control functions must be
specified for each component. The necessary data can be deduced from
frequency analysis techniques such as the Fourier transform.
The phase vocoder is an analysis/synthesis technique based upon the
Fourier transform. It is similar to additive synthesis, however, the
phase vocoder does not require any hypothesis on the analyzed signal,
apart from its slow variation. Both additive synthesis and the phase
vocoder split amplitude and frequency information in time. This
separation allows for such transformations as changing the duration of
the sound without changing its frequency contents, or transposing the
signal but keeping its duration constant [Moo78].
Granular synthesis starts from the idea of dividing the sound in time
into a sequence of simple elements called grains. The parameters of
this technique are the waveform of the grain, its duration, and its
temporal location. The first type of granular synthesis consists of
using, as grain waveform, the waveform of real sound segments and then
using the grain waveforms in synthesis in a different order or at
various times. This method refers back to the synthesis of sampled
sounds, except that in this case the elements are no longer complete
sounds but their fragments. The second type consists of using
waveforms such as Gauss functions modulated in frequency, which have
the property of locating the energy in the time-frequency plane.
Formantic waveform synthesis (abbreviation FOF, from french Forme
d'Onde Formantique) can be considered a granular synthesis technique.
The grain waveforms are sinusoids with decreasing exponential
envelope. This waveform approximates the impulse response of a second
order filter. It has a formantic spectral envelope. The overall
effect is obtained by the presence of various waveforms of this type.
The waveforms are repeated, synchronous to the pitch of the desired
sound. By varying the repetition period, the frequency of the sound
varies, whereas by varying the basic waveform, the spectral envelope
varies. With the use of FOF generators in parallel one can easily
describe arbitrary time-varying spectra in ways that are at once
simpler and more stable then the equivalent second-order filter bank
[RPB93].
Many synthesis techniques are based on the transformation of real
signals or signals that have been generated using one of the above
mentioned methods. A first set of transformations consists of linear
transformations. These transformations are based mainly on the use of
digital filters. According to the frequency response of the filter we
can vary the general trend of the spectrum of the input signal. Thus,
the output will combine temporal variations of the input and the
filter. When a particular rich signal is used and the transform
function of the filter has a very specific shape, this method of
generating sound is usually called subtractive synthesis or
source/filter synthesis. The common procedure is linear prediction,
which employs an impulse source or noise and a recursive filter
[Mak75]. A closely related technique is cross synthesis in
which the spectral evolution of one sound is imposed upon a second
sound. In the field of linear transformations we also find delay
lines and comb filters. With the combination of delay lines and
digital filters a reverberating effect can be obtained.
The technique of wave shaping is probably the most applied non-linear
transformation. A linear filter can change the amplitude and phase of
a sinusoid, but not its waveform, whereas the aim of wave shaping is to
change the waveform. Wave shaping distorts the signal introducing new
harmonics, but keeps the period of the signal unchanged. Wave shaping
exploits this property to generate signals, rich in harmonics, from a
simple sinusoid. If the function F(x) describes the distortion, the
input
is converted into
.
Frequency modulation (FM) has become one of the most widely used synthesis
techniques. The technique consists of modulating the instantaneous
frequency of a sinusoidal carrier according to the behavior of
another signal (modulator), which is usually sinusoidal. Several
variants exist using complex modulators or modulators in cascade.
2.1.3 Synthesis applications
Signal models try to capture the characteristics of the sound they
intend to produce. These characteristics describe the acoustical or
physical qualities of the sound. Many of the signal synthesis models
can be described in terms of basic units, such as oscillators,
filters, delay lines, put in cascade or in parallel. These basic
modules are often called unit-generators, a name due to Max
Mathews. He developed the first suite of synthesis programs, named
Music I to Music V [Mat69], [MMR74]. This
early work spawned numerous descendants. The term Music N is generally
used to refer to this extended family of programs.
In Music N unit-generators such as oscillators and random generators
generate streams of audio samples, while signal modifiers such as
filters, amplifiers, and modulators process these streams. Networks
of these unit-generators can be patched together. The resulting
network is called an instrument. The action of the instrument
is defined by the connectivity graph of the unit-generators as well as
by the acoustical parameters that control the action of the
unit-generators (parameters such as frequency and amplitude for
oscillators, for example). The activation of instruments is
controlled by note statements that bind the instrument, the
action times, and the acoustical parameters together. A score
is a collection of note statements and instrument definitions
[Loy89].
Music N creates instrument templates, instantiates them at the right
times, and binds the acoustical parameters to the unit-generators. It
also merges the outputs of the various instruments into the final
output. This output is stored into a file or sent to the sound output
device. In Music V several instances of an instruments can exist
simultaneously. Music V also generates a block of samples for each
unit-generator on each pass, instead of one sample. This largely
increases the efficiency.
Two other implementations of the Music N model are cmusic,
written by F.R. Moore, and Csound , written by Barry Vercoe
[Ver86]. Because both are written in C, these
implementations are available on a wide range of platforms.
The approach of signal processing networks is found in most
customizable synthesis environments, such as SynthBuilder
[PSS+98], and Max/FTS [Puc91b,Puc91a].
Both provide a graphical environment to construct patches. The
MAX/FTS environment uses a message passing approach to communicate
between the signal and control objects. Complex patches can be
created that describe the synthesis and the control of synthesis. A
patch can be in two states: in the first state the patch can be
edited, in the second state the patch is executed. The environment is
extendible and programmers can write external objects. Both of the
above mentioned systems are designed for real-time performance.
The CHANT program, developed by Xavier Rodet and his colleagues, offers
a model of vocal synthesis based on the formantic waveform synthesis,
explained above. Apart from a new synthesis technique they worked on
a better control over it. They observed that the Music N patch model
had its limitations: ``patch languages that exists are weak in their
ability to specify the elaborate control levels that resemble
interpretation by an instrumentalist, for example, expressiveness,
context-dependent decisions, timbre quality, intonation, stress and
nuances'' [RPB93]. To overcome this
they adopted a hierarchical synthesis-by-rule methodology. The method
of controlling the collection of FOF generators is thru a large set of
global parameters that can be changed by the concurrent execution of
cooperating subroutines that implement the synthesis rules
corresponding to the vocal model under development. All the parameters
begin with default values, which, when executed, produce a normalized
vocal synthesis. Libraries of previously developed routines that
implement various rule sets are available for composers; as well,
there are facilities to create new rule sets and to modify and extend
existing ones. These libraries are called knowledge models of
different productions. It became clear that, as the cooperating rule
sets grew and became more complicated, that it was turning more and
more into an AI problem to represent the control flow. The FORMES
language, based on Lisp, was developed for this purpose, and will be
described later.
Foo is a music composition environment developed and designed by Eckel
& González-Arroyo [EGA94]. Their research focused on composed sound which can be
fully integrated in a compositional process. A musical object will in
general be a complex object, decomposable into simpler elements,
organized under a logical structure. A music piece can be regarded as
a dynamic, compound structure, where behavioral laws and signal
processing patches combine. This structure can be viewed both as a
sound production entity and as a logical object, artistically
meaningful. They do not set a priori a boundary between the level of
sound object definition and that of its musical manipulation. They
wish to embed in one whole environment all actions from the
micro-control of the signal processing to the composition of the score
and search for new perspectives of relationship between sound matter,
musical material and form. Foo consists of two parts: a kernel layer
and a control layer. The kernel layer provides the necessary low level
abstractions to define and execute signal processing patches and is
written in Objective-C. The control layer consists of a set of Scheme
types and procedures for the creation, representation, and
manipulation of sound concepts and musical objects in general. Foo
allows the expression of temporal relationships different modules. The
two most important parameters of this time context are the time origin
and the duration.