Ircam - Centre Georges-Pompidou Équipe Analyse/Synthèse


Back to the GDGM Homepage: Groupe de Discussion sur le Geste Musical

GDGM - Report from the Third Meeting - 1998


April, 30th 1998 - Salle Stravinsky - IRCAM



Invited lecturer: Ioannis Zannos - Staatliches Institut für Musikforschung - Berlin




Ioannis Zannos - just before his presentation (on April 30th).



Comments:

about two recent papers presented by Dr. Zannos' group in the KANSEI workshop - Genes, Nov. 97 (ref: M. Wanderley, "KANSEI - rapport de mission" - GDGM - page interne)

I. Zannos and P. Modler and gestural capture at the Staatliches Institut für Musikforschung, Berlin - Germany

In the first of the two papers presented at the workshop, the authors describe ongoing research on a real-time system for gesture controlled music performance. Two frameworks are compared:

The first one is developed by the authors and uses the ``JET'' - Java Environment for TEMA, connected to a prototype of the MIDAS system for real-time open distributed processing (University of York). The idea here is to bring advantages of open heterogeneous networks to experimental music design and to performance systems.

The JET architecture is based on the MIDAS signal processing engine, with low level drivers (written in C), input and musical data-handling processes (Java), a mid level GUI (also in Java) and a high-level graphical data representation (written in Java and/or VRML). Java processes take over the tasks of GUI, user interaction management and also high-level data management. They send configuration and performance instructions to MIDAS, which produces sound and/or animated processes. Both (Java and MIDAS processes) can be distributed over the network. Once sound processing on MIDAS is configured, it can be controlled via MIDI, process serial input from the SensorGlove (see below), or mouse and a computer keyboard.

The second framework is based onthe program SuperCollider for Apple Macintosh and is used in order to identify which high-level programming features are required. This framework will later be implemented in JET by the authors, since SuperCollider is only available for Macintosh. The authors use it as a working paradigm for programmable real-time sound processing architecture.

Both frameworks represent gesture data input from a data-glove as well as data for performance control and sound synthesis configuration. According to the authors, the three main tasks of this work are (1) provide a flexible prototyping environment so that experiments with different approaches can be designed, (2) devise mappings of the input parameters to sound production such that the relationship between hand movements and the resulting sound is intuitively graspable for the user, but not trivial, and (3) apply features and gesture recognition techniques to achieve high-level communication with the computer.

The systems are controlled by the SensorGlove, a custom data-glove constructed at the Technical University of Berlin, used as a multi-parametric input device for music. The SensorGlove is a custom gesture input device that has sensors measuring the bending angle for every joint of each finger (providing a resolution less than 1/10 of a degree in a finger joint flexion measurement) as well as hand movement acceleration. The transmission sampling rates are up to 200 Hz and the resolutions are between 7 and 14 bits.

Gesture data processing involves three steps in this system: preprocessing (data range calibration and scaling, preliminary feature extraction), gesture recognition and mapping to or generation of performance parameters.

To perform gesture acquisition, preprocessing is done in order to improve the quality of the input by filtering noise, smoothing rapid jumps and detecting and undoing wrapping of the input values.

The GUI consists of a central panel with buttons for operating glove routines and for recording, storing and replaying gestures, numeric and slider display of the input parameters and switches for routing the parameters to adjustable MIDI parameters and also a graphic display of the input parameters.

Mapping:

The mapping can consist of simple mapping of the input parameters to synthesis parameters or/and feature extraction with Time Delay Neural Networks (TDNN - from the Stuttgart Neural Network Simulator), in order to combine direct control of sound with higher level semantic processing of gestures.

In the second paper the authors report that using only direct mapping of single sensor values (12 finger ankles and 3 hand acceleration values) to sound parameters can provide good results concerning the possibilities of controlling parameters of the sound algorithm (FM, granular, analog synthesis), but that coordinated control of a larger number of connected sensors was difficult. This effect seems natural since coordinated control of many parameters usually requires a leaning phase, and they have reported that the performer had a short time to get familiar to sensor functions. Their conclusion was that conscious control of a large number of parameters is not possible with such directly connected sensors, i.e. one-to-one mapping).

The next control (mapping) strategy tested was the use of ANNs (artificial neural networks) that have been trained with the gestures of the sensor glove. The glove is connected to a Macintosh and the sound generation is performed by synthesizers, SuperCollider or audio MAX/ISPW (FTS), controlled via MIDI.

Gesture recognition:

Three hand gestures have been used and the net was trained by a set of 8 gesture recording for each pattern --- 2 sets each of 4 recordings with different timing levels: very slow, slow, fast, very fast.

They report ``good'' recognition results for the gestures presented (with different velocities). Gestures are presented in real-time and at each sample time, the net gives back an output vector, indicating the recognition of the pattern. They noticed that the net responds on moving phases of the gestures more than on stable phases.

The authors further propose a distinction between possible hand gestures in a Symbolic level and a Parametric level. A separation of gestures in sub-gestures as: (1) five finger gestures - joint angles and spreading, and (2) one hand gesture - translation and rotation of the whole hand. They consider that with the sub-gesture architecture, gesture sequences can be recognized which use only a part of the hand as a symbolic sign, while other parts of the hand movements are used as parametric signs. As an example, they propose a straightened finger indicating mouse down (symbolic gesture) and moving the whole hand determining the changing of the position of the mouse (parametric gesture). This way, different sub-gestures (symbolic or parametric) can be mapped in different ways to sound parameters.

Example: Hit detection using sub-gestures.

The net was trained to detect hits of a single finger, using derivates of higher order to indicate the hit --- the second derivative of the finger sub-gesture is proportional to the occurrence of the force and the third derivative to the changing of the force.

They finish the paper by discussing topics relating interactive computer systems, emotions and neural nets.

References:

P. Modler and I. Zannos, ``Emotional aspects of gesture recognition by a neural network, using dedicated input devices,'' in Proc. KANSEI - The Technology of Emotion Workshop, 1997.

I. Zannos, P. Modler, and K. Naoi, ``Gesture controlled music performance in a real-time network,'' in Proc. KANSEI - The Technology of Emotion Workshop, 1997.


To find more about