Detection and modeling of fast attack transients
(to be presented at ICMC 2001, Cuba Sept 2001)
Xavier Rodet & Florent Jaillet. Ircam, Paris
The term attack transient does not have a precise definition. It corresponds to the beginning of notes produced by
an instrument. Attacks, as they are called here, are zones of short duration (about a few ms) and fast variation of
the sound signal with an abrupt increase in short-time energy distributed on the whole spectrum and noticeable in
the high frequencies since energy is usually concentrated in the low ones.
There are many motivations for attack detetcion and modeling. Improving general analysis techniques is one
more. Classical additive analysis, for instance, of a sound does not preserve the spectral richness of attacks in the
resynthesized sound. Moreover, a visible pre-echo appears right before the attacks of the resynthesized sound
since, when a window extends on an attack, sinusoids are detected at a time where they are not yet present in the
signal. Therefore, detection of attacks is also necessary to ensure a synchronization of analysis windows in order
to forbid them to overlap on attacks.
Review of detection methods
The attack detection and modeling method developed in this research:
- Should not use additive analysis results, in order to be usable for other purposes (segmentation, instrument
- Should succeed in every type of sound (particularly polyphonic sounds) with a good time accuracy.
- Should be simple to use: analysis parameters as much adjusted as possible automatically.
Detection of attacks based on a time-frequency representation
According to the previous objectives, the Short-Time Fourier Transform (STFT) has been chosen, in particular
because of the small load of calculation allowed by FFT. The Fast Wavelet Transform could have been used as
well. Let us call |X(k, f)| the magnitude of the STFT at sample k and frequency f. It is computed from the sound
signal on a window of size N and with a step size S.
Construction of the observation function
For the goal of detection, the definition of an attack adopted in this work is an area of short duration of
the STFT in which marked energy peaks appear in several frequency bands.
Examining the energy in one frequency band, i.e. fixing f to some value leads to a signal |Xf (k)| in which short
duration peaks are looked for. The signal |Xf (k)| is then studied in observation windows Wm of length K at
locations m. A peak is supposed to occur in an observation window when |Xf (k)| shows a triangular shape with
a high maximum above prior and post plateaus.
Therefore, in window Wm, the next step is to approximate |Xf (k)| by such a triangular function and to measure
its height. To keep calculation load low, we avoid classical optimal estimation. Instead, the maximum of a
possible peak is said to be the maximum value M of |Xf (k)| in Wm. and edges of the triangle are easily estimated.
Calling Mb (respectively Ma) the mean of |Xf (k)| in the window Wm before (respectively after) the triangle, an
indicator function is computed as (except for some special cases):
I(f,m)=( M-Mb + M-Ma ) / ( Mb+Ma )
I(f,m) takes large values when there is a large peak in the window Wm. The center of gravity of the triangle
is chosen as the precise instant of the attack.
Selection of aggregates and final decision
A threshold Td is then applied to I(f,m) leading to a thresholded observation function J(f,m)
Non-zero values of J(f,m) indicate peaks in the STFT. Then the areas of the STFT in which several peaks appear at
close temporal and frequential positions are aggregated as one attack.
The weight of an aggregate is the weighted sum of the values of J(f,m) in the aggregate. Only the aggregates the
weight of which is higher than a given threshold are preserved and considered to be detected attacks.
Data base and choice of parameter values
A data base of 70 recordings of various types has been constituted to test the detection algorithm. It is not large
enough for statistically significant results, but a larger data base requires much time for hand labeling of attacks.
Optimal parameter values have been found rather dependent on the type of sounds (polyphonic or not, clear/soft
attacks...). The tests however permitted to determine ranges for the parameters allowing good results.
Weighting according to Frequency
The positions of the non-zero values of J(f,m) are compared to attacks marks placed by hand. For a given frequency,
a non-zero value occurring within less than 10 ms of an attack mark is considered as a good detection. Otherwise,
it is considered as a false alarm. The rate of good detection and false alarm are calculated for each frequency and for
various threshold values. It appears that medium frequency bands give a more reliable information than others and
the lowest frequency bands cause a great number of false detection. Weights according to frequency are adjusted
Time-frequency representation and reconstruction of attack transients
For each detected attack, the aggregate, i.e. a subset of the STFT, is the time-frequency representation of the
attack. The complex STFT is used here in order to exactly reconstruct the attack signal. For example, the
reconstructed attack signal can then be subtracted from the original signal to remove the attack in a recording.
This time-frequency representation is a STFT which is null everywhere, except in the aggregate where it is equal
to the original STFT. Consequently, the method of Griffin and Lim is applied to compute an optimal
Adjustment of the size of reconstructed attacks
When reconstruction is done from the aggregates formed during detection, reconstructed attacks are of very short
duration. Effectively, to avoid spurious detection, the detection threshold Td is rather high. Therefore, the
reconstruction aggregate is defined with a reconstruction threshold Tr < Td. Adjustment of Tr allows user control
of the size of the reconstructed attack. A supplementary improvement uses the Resonance Modeling analysis
technique to better estimate resonance modes which constitute the attack signal.
Implementation and graphical use interface
A detection, modeling and reconstruction program has been implemented. Its GUI facilitates usage according to
user needs. It allows visualization of STFTs (sonagram), observation functions, aggregates, detected attacks and
original and reconstructed sound signals. It also allows the user to adjust parameter values according to sound or
visual results. Detected attacks' instants and reconstructed attacks are stored in an SDIF file using the marks type
and the time domain samples or the STFT type. This program has been applied to some of the sounds of the
Sound Analysis and Synthesis Panel at ICMC2000, largely improving resynthesis. Other examples will be
proposed at the conference. For instance, a performance of Indian Sarod (strings) and Tabla (percussion) has been
analyzed. Tabla attacks where all correctly detected and modeled to isolate the tabla part.
A large data base of sounds and a systematic study of transients would provide a priori information (probability
distributions) of attacks and permit to improve parameter values, to improve the shape of the approximation
function and to optimize weightings according to energy and frequency bands.