Real-Time Audio Mosaicking

This page documents work on real-time audio mosaicking using the MuBu Max/MSP modules and a set of FTM & Co based Max/MSP audio analysis patches.

Audio mosaicking refers to the process of recomposing the temporal evolution of a given target audio file from segments cut out of source audio materials.

In the following, we present different variants of audio mosaicking in separate sub-sections:

1.	*MFCC Based Frame by Frame Audio Mosacing*
2.	*Segment Based Audio Mosaicking*
3.	*Gesture Driven Audio Mosaicking*

Most of the documented work has been developed in collaboration with the composer Marco Antonio Suarez Cifuentes and has been used in some of his pieces such as Poetry for // dark -/ dolls, Caméléon Kaléidoscope, and Plis. The examples given on this page are meant as technical demonstrations and do not directly relate to Marco Suarez' musical work, even if they partly use a few sound extracts from instrumental parts of his pieces. For more information about the related musical work, please refer to the documentation at Brahms or on Marco Suarez' web site.

The majority of the examples below use the following 17 sec extract of recorded speech in french as target:

target audio stream (0:17)

Consider listening to this audio file several times in order to memorize it.

MFCC based Frame by Frame Audio Mosacing

The first audio mosacing we implemented uses a frame by frame extraction of MEL-frequency cepstrum coefficients (MFCC).

For the source materials, the coefficients are extracted offline and stored into a Max/MSP MuBu container. The target can be any incoming audio stream that is analyzed in real-time using the same algorithm as the offline analysis of the source materials. For each analyzed target frame, a corresponding source frame is selected minimizing a weighted euclidian distance between the target and source MFCC values. The search of the closest source frame is performed by the MuBu k-nearest-neighbours (KNN) module (mubu.knn) based on a KD-tree representation (see SMC 2009 publication).

All of the following examples use the first coefficients (about 8) out of 23 coefficients calculated over the whole audio spectrum. The frame size (FFT size) is 46.44 ms and the frame period (hop size) 11.61 ms. These parameters allow for a sufficient temporal resolution to represent the articulation of the voice file as well as for sufficiently long synthesis frames to preserve some of the original texture of the source materials.

	*extract(s) of source file*	*synthesized sound*
*farm dogs (2:42)*
*bubbles (0:25)*

Normalizing/Adapting Target and Source Description

In the above examples of MFCC based frame by frame audio mosaicking we are actually normalizing the MFC coefficients the of source materials and the target audio stream before calculating the distance in the matching process. The normalization uses the mean and standard deviation of the coefficients and, thus, adapts the description space of the target frames to the description space of the source frames. While without normalization the overall timbre of the resulting sound is closer to the target frames, the normalization allows to adapt the range of the timbre of the target stream to the range of the source materials. This way the resulting sound better represents the overall timbral characteristics of the source material while still following the temporal evolution of the target stream.

When the target is not completely known in advance, the normalization can be calculated using some seconds of audio that are representative for the content of the real-time stream.

The following example uses a down mixed version of György Ligeti's electronic piece Artikulation (3:54) as source material and the voice recording from above as target.

The first version uses unnormalized coefficients and the resulting timbre is closer to the original voice recording.

unnormalised matching

The second version normalizes the coefficients in the matching process. The resulting timbre is still clearly related to the original voice recording, but its overall timbre better represents the audio content of the source material.

normalised matching

Number of MFC Coefficients

The following series of examples is based on the same source audio file (Artikulation) as the previous examples.

In the different versions the number of MFC coefficients used for unnormalized matching vary between only the first and all coefficients. The increasing number of coefficients can be interpreted as representing the spectral envelope with more and more detail and thus approximating the timbre of the target audio stream with more and more precision.

While the first MFCC represent only the energy of the current frame, the 23 coefficients represent all details of the spectral envelope.

	*synthesized sound*
*1 coefficient*
*2 coefficients*
*4 coefficients*
*8 coefficients*
*16 coefficients*
*23 coefficients*

Note that the representation of the timbral details by the selected source frames not only depends of the number of coefficients, but also by the availability of a frame matching the current target frame in the given source material.

For comparison, here also an example with all 23 coefficients using normalized matching:

	*synthesized sound*
*23 coefficients*

Musical Sounds

A last series of examples of normalized MFCC based normalized frame by frame audio mosaicking is based on recordings of solo instruments performing extracts of the score of Caméléon Kaléidoscope by Marco Antonio Suarez Cifuentes. These source materials will also be used in further examples of segment based mosaicking below.

The given extracts of the source audio recordings contain only the first few second of the original files having a total duration of about 2 minutes each. Anyway they should give give an approximate idea of the timbre and texture of these materials.

	*extract of source file*	*synthesized sound*
*bass (2:04)*
*flute (1:45)*
*trombone (2:01)*
*violin (2:17)*

An additional example uses all four extracts as source audio material of a total duration of about 8 minutes.

tutti (8:07)

Segment Based Audio Mosaicking

The second approach to audio mosaicking that we have developed is based on a segment based description. This work has also been documented in an SMC 2010 publication.

For segment based audio mosaickingosaicking the source materials as well as the target stream are automatically segmented. Each segment is described by 15 descriptors roughly representing its duration, energy envelope, pitch content, and timbre.

While the frame by frame audio mosaickingosaicking allows for the synthesis of sounds that follow very closely the timbre evolution of the target audio stream this approach aims to preserve much better the morphology of the source material and capture only the essential characteristics of the target stream. The synthesized sound concatenated from much longer segments.

The complete description of a segment of the real-time target stream is only available at the end of the segment. The corresponding source segment is selected using KNN. Each of the coefficients can be matched with or without normalization. In the examples below, all descriptor values, but the segment duration have been normalized. Also here normalization can seen as a means to adapt the boundaries and variability of the characteristics of the target stream to the given source materials.

The following series of examples uses the same source audio files as the instrumental examples above featuring bass, flute, trombone, and violin extracts.

The analysis of the first few seconds of each recording is visualized in screen-shots of the MuBu container for Max/MSP. The visualization shows a sonogram like representation of 23 MFC coefficients, the corresponding wave as well as temporally aligned plots of the extracted loudness and pitch. The segment description is calculated from these descriptors.

As in the examples above, the extracts of the source materials are limited to a few seconds at the beginning of the pre-analysed audio recordings (the extracts do not match the visualized duration).

For each instrument (as well as the tutti example) three variations have been generated by arbitrarily varying the importance of the different descriptors in the matching.

Note that each of the source files is segmented into only around 200 segments. Given that each segment of the real-time audio stream has to match one of these 200 segments the synthesized sound will only very roughly approximate the target stream. The examples use the same speech recording as the examples above (the automatic segmentation generates for this target stream 45 segments). The resulting sounds are much more abstract than in the frame by frame audio mosaickingosaicking examples.

Since the description of a target segment is available only at the end of each segment, all segments are delayed to the longest possible segment duration, in order to preserve the rhythm/timing of the target sound.

Bass

extract of source file (2:04)

Analysis and segmentation of the bass recording (short extract at the beginning of the file)

concatenated sounds (3 variations)

Flute

extract of source file (1:45)

Analysis and segmentation of the flute recording (short extract at the beginning of the file)

concatenated sounds (3 variations)

Trombone

extract of source file (2:01)

Analysis and segmentation of the trombone recording (short extract at the beginning of the file)

concatenated sounds (3 variations)

Violin

extract of source file (2:17)

Analysis and segmentation of the violin recording (short extract at the beginning of the file)

concatenated sounds (3 variations)

Tutti

concatenated sounds (3 variations)

Gesture Driven Audio Mosaicking

Mogees is an interactive gestural-based surface for realtime audio mosaicking.

When the performer touches the surface, Mogees analyses the incoming audio signal and continuously looks for its closest segment within the sound database. These segments are played one after the other over time: this technique is called concatenative synthesis. For instance, loaded a series of voice samples, a graze in the surface could corresponds to a whispering while a scratch would trigger more shouted sounds.

The wooden surface can be "played" with any tool such as hands and Mogees will always try to find a correspondent sound to it. It can also be applied to other sound sources such as voice or acoustic/electric instruments.

Mooges has been developed in collaboration with Norbert Schnell and takes full advantage of the MuBu environment for MaxMSP. It is currently used in the Airplay project by the IRCAM composer Lorenzo Pagliei.

Mogees has been exposed at the Beam festival at Brunel University in London on the 24/25/26 of June 2011.