An acoustical signal p(t) can be converted to an electrical signal e(t) and vice versa using microphones and loudspeakers. The electrical signal e(t) can be represented in a digital format as a series of digitally coded integers s(n). To convert e(t) to a digital signal a two step quantification is necessary. First, the signal will be sampled at fixed time intervals T_s. The value of the sampling frequency F_s = 1 / T_s is determined by the highest frequency component F_m present in the signal. According to the Nyquist theorem the sampling frequency should be at least twice the maximum frequency contained in the signal: F_s > 2F_m. Second, the values e(nT_s) must be quantized to fit an integer value. The simple method would be to round the values e(nT_s) to the nearest integer value to obtain the series s(n). The digitizing is realized by a analog-to-digital converter (ADC). With a digital-to-analog converter (DAC) the digital signal s(n) can be changed back to an electrical signal and fed to a loudspeaker [Mat69].

This correspondence between series of numbers and sounds allows us to use digital methods and devices for the synthesis and manipulation of sounds. Instead of using existing, digitized sounds, we can apply digital synthesis techniques to create new, unexisting sounds artificially. A whole new field of undiscovered sounds opens up. The sound samples can be calculated out of time, i.e. the sound is calculated in its whole length and is played when the synthesis is finished. Or the sound can be generated in real-time: the synthesis and the playing are interleaved such that the sound can be heard at the time of the calculation.

More than one sound signal can be generated at the same time. If a stereo signal is desired, two signals must be synthesized. In the general case N signals can be generated. Each signal is sometimes called a channel or a track. The convenient and most used way to stock a multi-channel sound is to interleave the samples of the distinct channels. We will call a frame the set of N samples of the N channels signal.

Since any audible sound can be represented as a series of sampled values, theoretically any sound can be generated by the computer. The digital representation of a sound, however, asks for an enormous amount of data and it is therefore inconceivable to generate this data by hand. To describe and facilitate the generation of sound synthesis models are used. These models take a number of control parameters and produce a digitized sound signal.

Two main categories of models can be distinguished. The first category models the sound signal and comes historically first. This category is generally called signal modeling. Signal models describe the acoustical structure of the sound. The desired waveform is described using signal processing techniques such as oscillators, filters, and amplifiers.

The second category models the acoustical causes of sound generation. This category is generally called physical modeling. Physical models depart from a description of the mechanical causes that produce the sound. These models describe the sound production mechanisms and not the resulting waveform [Ris93], [DP93]. There are different techniques to translate physical descriptions of music instruments into a digital synthesis technique. Three main techniques can be distinguished: the wave-guide models [Smi96,Smi97], the modal descriptions of instruments [MA93,EIC95], and the mechanical models [CLF93]. In this text we will only consider the signal models.

2.1.2 Signal models

Among the signal models some of the well known techniques are [DP93,Roa96]:

Sampling,
Wave table synthesis,
Additive synthesis,
Phase vocoder,
Granular synthesis,
Formantic waveform synthesis,
Subtractive synthesis,
Wave shaping,
Frequency modulation,

The sampling technique stores concrete sounds, often recordings of musical instruments, into tables. The synthesis consists of reproducing the sound by reading the values in the table. The frequency of the sound can be modified by changing the speed at which the values in the table are read, and an amplitude curve can be applied to change the dynamics and duration. The technique is very simple but effective to produce rich sounds quickly.

Like sampling, wave table synthesis reads the values stocked in a table to produce the sound. The table, however, contains only one period of the wave form of the signal. It is a fast technique to implements sine-wave, rectangular, or triangular wave oscillators [Mat69]. The waveform stocked in the table can have any shape, and may vary during the synthesis.

In additive synthesis, complex sounds are produced by the superposition of elementary sounds. The goal is for the constituent sounds to fuse together, and the result to be perceived as a single sound. Any almost periodic sound can be approximated as a sum of sinusoids, each sinusoid controlled in frequency and amplitude. Additive synthesis provides great generality, allowing to produce almost any sound. But a problem arises because of the large amount of data to be specified for each note: two control functions must be specified for each component. The necessary data can be deduced from frequency analysis techniques such as the Fourier transform.

The phase vocoder is an analysis/synthesis technique based upon the Fourier transform. It is similar to additive synthesis, however, the phase vocoder does not require any hypothesis on the analyzed signal, apart from its slow variation. Both additive synthesis and the phase vocoder split amplitude and frequency information in time. This separation allows for such transformations as changing the duration of the sound without changing its frequency contents, or transposing the signal but keeping its duration constant [Moo78].

Granular synthesis starts from the idea of dividing the sound in time into a sequence of simple elements called grains. The parameters of this technique are the waveform of the grain, its duration, and its temporal location. The first type of granular synthesis consists of using, as grain waveform, the waveform of real sound segments and then using the grain waveforms in synthesis in a different order or at various times. This method refers back to the synthesis of sampled sounds, except that in this case the elements are no longer complete sounds but their fragments. The second type consists of using waveforms such as Gauss functions modulated in frequency, which have the property of locating the energy in the time-frequency plane.

Formantic waveform synthesis (abbreviation FOF, from french Forme d'Onde Formantique) can be considered a granular synthesis technique. The grain waveforms are sinusoids with decreasing exponential envelope. This waveform approximates the impulse response of a second order filter. It has a formantic spectral envelope. The overall effect is obtained by the presence of various waveforms of this type. The waveforms are repeated, synchronous to the pitch of the desired sound. By varying the repetition period, the frequency of the sound varies, whereas by varying the basic waveform, the spectral envelope varies. With the use of FOF generators in parallel one can easily describe arbitrary time-varying spectra in ways that are at once simpler and more stable then the equivalent second-order filter bank [RPB93].

Many synthesis techniques are based on the transformation of real signals or signals that have been generated using one of the above mentioned methods. A first set of transformations consists of linear transformations. These transformations are based mainly on the use of digital filters. According to the frequency response of the filter we can vary the general trend of the spectrum of the input signal. Thus, the output will combine temporal variations of the input and the filter. When a particular rich signal is used and the transform function of the filter has a very specific shape, this method of generating sound is usually called subtractive synthesis or source/filter synthesis. The common procedure is linear prediction, which employs an impulse source or noise and a recursive filter [Mak75]. A closely related technique is cross synthesis in which the spectral evolution of one sound is imposed upon a second sound. In the field of linear transformations we also find delay lines and comb filters. With the combination of delay lines and digital filters a reverberating effect can be obtained.

The technique of wave shaping is probably the most applied non-linear transformation. A linear filter can change the amplitude and phase of a sinusoid, but not its waveform, whereas the aim of wave shaping is to change the waveform. Wave shaping distorts the signal introducing new harmonics, but keeps the period of the signal unchanged. Wave shaping exploits this property to generate signals, rich in harmonics, from a simple sinusoid. If the function F(x) describes the distortion, the input is converted into .

Frequency modulation (FM) has become one of the most widely used synthesis techniques. The technique consists of modulating the instantaneous frequency of a sinusoidal carrier according to the behavior of another signal (modulator), which is usually sinusoidal. Several variants exist using complex modulators or modulators in cascade.

2.1.3 Synthesis applications

Signal models try to capture the characteristics of the sound they intend to produce. These characteristics describe the acoustical or physical qualities of the sound. Many of the signal synthesis models can be described in terms of basic units, such as oscillators, filters, delay lines, put in cascade or in parallel. These basic modules are often called unit-generators, a name due to Max Mathews. He developed the first suite of synthesis programs, named Music I to Music V [Mat69], [MMR74]. This early work spawned numerous descendants. The term Music N is generally used to refer to this extended family of programs.

In Music N unit-generators such as oscillators and random generators generate streams of audio samples, while signal modifiers such as filters, amplifiers, and modulators process these streams. Networks of these unit-generators can be patched together. The resulting network is called an instrument. The action of the instrument is defined by the connectivity graph of the unit-generators as well as by the acoustical parameters that control the action of the unit-generators (parameters such as frequency and amplitude for oscillators, for example). The activation of instruments is controlled by note statements that bind the instrument, the action times, and the acoustical parameters together. A score is a collection of note statements and instrument definitions [Loy89].

Music N creates instrument templates, instantiates them at the right times, and binds the acoustical parameters to the unit-generators. It also merges the outputs of the various instruments into the final output. This output is stored into a file or sent to the sound output device. In Music V several instances of an instruments can exist simultaneously. Music V also generates a block of samples for each unit-generator on each pass, instead of one sample. This largely increases the efficiency.

Two other implementations of the Music N model are cmusic, written by F.R. Moore, and Csound , written by Barry Vercoe [Ver86]. Because both are written in C, these implementations are available on a wide range of platforms.

The approach of signal processing networks is found in most customizable synthesis environments, such as SynthBuilder [PSS⁺98], and Max/FTS [Puc91b,Puc91a]. Both provide a graphical environment to construct patches. The MAX/FTS environment uses a message passing approach to communicate between the signal and control objects. Complex patches can be created that describe the synthesis and the control of synthesis. A patch can be in two states: in the first state the patch can be edited, in the second state the patch is executed. The environment is extendible and programmers can write external objects. Both of the above mentioned systems are designed for real-time performance.

The CHANT program, developed by Xavier Rodet and his colleagues, offers a model of vocal synthesis based on the formantic waveform synthesis, explained above. Apart from a new synthesis technique they worked on a better control over it. They observed that the Music N patch model had its limitations: ``patch languages that exists are weak in their ability to specify the elaborate control levels that resemble interpretation by an instrumentalist, for example, expressiveness, context-dependent decisions, timbre quality, intonation, stress and nuances'' [RPB93]. To overcome this they adopted a hierarchical synthesis-by-rule methodology. The method of controlling the collection of FOF generators is thru a large set of global parameters that can be changed by the concurrent execution of cooperating subroutines that implement the synthesis rules corresponding to the vocal model under development. All the parameters begin with default values, which, when executed, produce a normalized vocal synthesis. Libraries of previously developed routines that implement various rule sets are available for composers; as well, there are facilities to create new rule sets and to modify and extend existing ones. These libraries are called knowledge models of different productions. It became clear that, as the cooperating rule sets grew and became more complicated, that it was turning more and more into an AI problem to represent the control flow. The FORMES language, based on Lisp, was developed for this purpose, and will be described later.

Foo is a music composition environment developed and designed by Eckel & González-Arroyo [EGA94]. Their research focused on composed sound which can be fully integrated in a compositional process. A musical object will in general be a complex object, decomposable into simpler elements, organized under a logical structure. A music piece can be regarded as a dynamic, compound structure, where behavioral laws and signal processing patches combine. This structure can be viewed both as a sound production entity and as a logical object, artistically meaningful. They do not set a priori a boundary between the level of sound object definition and that of its musical manipulation. They wish to embed in one whole environment all actions from the micro-control of the signal processing to the composition of the score and search for new perspectives of relationship between sound matter, musical material and form. Foo consists of two parts: a kernel layer and a control layer. The kernel layer provides the necessary low level abstractions to define and execute signal processing patches and is written in Objective-C. The control layer consists of a set of Scheme types and procedures for the creation, representation, and manipulation of sound concepts and musical objects in general. Foo allows the expression of temporal relationships different modules. The two most important parameters of this time context are the time origin and the duration.