My research area is Statistical Signal Processing. My main topics of research include statistical modeling of signals, speech processing and their applications to Music. My research has focused initially on the generalization of statistical models for signals, especially Hidden Markov Models (HMM). I studied during my PhD [ T ] models called Triplet Markov models that generalize the classical HMM (Hidden Markov Models (HMM)), with applications in image segmentation. I then directed my research toward speech processing, working on the segmentation of speech signals into phones, voice conversion, language models and HMM-based speech synthesis. My research studies on speech are interdisciplinary as they combine statistical modeling of signals, natural language processing and their application to music.
First, I present my PhD studies on unsupervised segmentation of signals based on triplet Markov models. Then, I present my work on speech processing, performed in the analysis and synthesis team at IRCAM.
[ T ] P. Lanchantin, Chaînes de Markov triplets et segmentation non supervisée des signaux/Unsupervised Signal Segmentation using Triplet Markov chains, PhD thesis from Institut National des Télécommunications, december 2006
Pairwise Partially Markov Chains
The principle of a pairwise Markov chains is to assume that the joint distribution of observed and hidden processes is that of a Markov chain. In the case of a partially pairwise Markov chain, one assumes directly the Markovianity of the distribution of the hidden process conditionally to the observations. In this context, one of my contributions was to propose the Expectation-Maximization (EM) algorithm in the case of pairwise Markov chains. I also studied with W. Pieczynski a special case of partially pairwise Markov chain applied to the segmentation of Gaussian processes with long correlation [ C3 ]. Experiments on synthetic data gave significant improvements compared to conventional models where the noise is a long correlated one while giving similar performance when the noise was independent. Nevertheless, the proposed method of parameters estimation was only valid for the centered case, which prevented us from testing the model on real images. So we continued our work with J. Lapuyade to refine our methods and make possible the unsupervised segmentation of Gaussian processes which are not necessarily centered [ A2 ].
[ C3 ] W. Pieczynski, P. Lanchantin, Restoring hidden non stationary process using triplet partially Markov chain with long memory noise, IEEE Workshop on Statistical Signal Processing (SSP 05), July 17-20, Bordeaux, France, 2005.
[ A2 ] P. Lanchantin, J. Lapuyade-Lahorgue and W. Pieczynski, Unsupervised segmentation of Triplet Markov chains hidden with long memory noise, Signal Processing, No. 88, Vol. 5, pp 1134-1151, May 2008.
Triplet Markov Chains
Pairwise Markov chains can be extended to triplet Markov chains [ C1 ]. The principle is to add one, or even several, auxiliary process(es) as the joint distribution of the triplet "hidden process, auxiliary processes, observed process" is that of a Markov chain. These very general models allow to palliate another limitation of conventional models which is to assume that the joint distribution is stationary. Indeed, by introducing an auxiliary process controlling changes in transition matrices of the process, we have shown the effectiveness of such a model in situations where the joint distribution of the hidden process and observation is nonstationary [ C2, A3 ] and we proposed algorithms for estimating parameters of a the considered Markov chain. This model was applied to the segmentation of synthetic and real images. A first observation is that this model does allow the consideration of different regimes, resulting in improved quality of segmentation in the case of images with both extensive homogeneous areas and areas with fine details. A second observation is that it is also possible to obtain a realization of the auxiliary process by the MPM estimator. This type of representation can be very useful, especially in segmentation of textures that can be precisely modeled by auxiliary processes.
[ C1 ] W. Pieczynski, D. Benboudjema and P. Lanchantin, Statistical image segmentation using Triplet Markov Fields, SPIEs International Symposium on Remote Sensing, September 22-27, Crete, Greece, 2002.
[ C2 ] P. Lanchantin and W. Pieczynski, Unsupervised non stationary image segmentation using triplet Markov chains, Advanced Concepts for Intelligent Vision Systems (ACIVS 04), Aug. 31-Sept. 3, Brussels, Belgium, 2004.
[ A3 ] P. Lanchantin, J. Lapuyade-Lahorgue and W. Pieczynski, Unsupervised segmentation of randomly switching data hidden with non-Gaussian correlated noise, Signal Processing, Vol. 91, No. 2, pp. 163-175, February 2011.
We have also studied with W. Pieczynski, as part of the study of triplet Markov chains, the possibilities for extending the classical probabilistic model to an "evidential" model, with the posterior probability of the hidden process given by the Dempster-Shafer fusion [ A1, AF1, CF2]. We then applied this evidential model to the segmentation of nonstationary processes. The main interest of our approach was to show that although the Dempster-Shafer fusion destroys Markovianity in the context of the hidden evidential chain, Bayesian segmentation is still possible via the triplet Markov chain approach.
A last of my contributions during my PhD thesis was to extend the fuzzy Markov chains previously studied by F. Salzenstein, to fuzzy Markov trees. The fuzzy segmentation was initially proposed to take account of imprecision on a site belonging to a thematic area. Thus, in a fuzzy signal cohabit homogeneous areas ("hard" clusters ) with fuzzy areas representing intermediate sites which may belong several hard clusters. The originality of these models is characterized by the fact that their distribution has both a discrete and a continuous components, the components being formed by discrete Dirac masses representing the weight assigned to each cluster lasts and the continuous components corresponding to the fuzzy classes (Lebesgue measure). We have proposed a multisensor fuzzy hidden Markov tree that we applied to the segmentation of astronomical images [ CF3 ].
[ A1 ] P. Lanchantin and W. Pieczynski, Unsupervised restoration of hidden non stationary Markov chains using evidential priors, IEEE Transactions on Signal Processing, Vol. 53, No. 8, pp 3091-3098, 2005.
[ AF1 ] P. Lanchantin et W. Pieczynski, Chaînes et arbres de Markov évidentiels avec applications à la segmentation des processus non stationnaires, Traitement du Signal, Vol. 22, No. 2, 2005.
[ CF2 ] P. Lanchantin et W. Pieczynski, Arbres de Markov Triplet et théorie de l'évidence, Actes du Colloque GRETSI 03, 8-11 septembre, Paris, France, 2003.
[ CF3 ] P. Lanchantin, F. Salzenstein, Segmentation d'Images Multispectrales par Arbre de Markov caché Flou, Actes du Colloque GRETSI 05, 6-9 septembre, Louvain-la-Neuve, Belgique, 2005.
HMM-based Speech Segmentation
During the ANR (French National Research Agency) project Vivos, I proposed and developed, in collaboration with A. C. Morris and X. Rodet, the software ircamAlign [ C4 ]. This system of segmentation of speech signals into phones is based on the HTK library. It is based on a particular modeling of speech signals by hidden Markov chains used in speech recognition. This modeling can be viewed as a special case of Markov chain triplet T = (U, X, Y) where U is the language model, X is the process of evolution of spectral features in time (sub-states of the HMM of each phoneme) and Y is the process of the observations (cepstral coefficients).
If the textual transcription is available, the distribution of the process U can be defined as that of a Markov chain whose topology is a graph constructed from the phonetic text giving the different pronunciations and possible connections. Many options are available for creating this graph. It is thus possible to allow the omission or repetition of words, the insertion of short pauses or paraverbal sounds like breathing or lip noises for which specific models have been learned. When the text is not available, such as in the case of a spontaneous speech signal, the distribution of U is defined as being that of a bigram or tri-gram learned on a selected French text set.
A set of multispeaker French models have been learned from the corpus BREF80. A confidence index based on posterior probabilities is calculated for each phone to facilitate a possible manual correction of segmentation results.
During the segmentation, the structure of speech (syllables, words, breath groups) is extracted from transcription and aligned to the speech signal in order to build databases of units to the development of a synthetic text-to-Speech (TTS) [ C7 ] by units selection. ircamAlign is used by ircamTTS and ircamCorpusTools [ AF2 ] which is a speech units database management system. On the other hand, ''ircamAlign' is used in the ANR project Rhapsody for developing reference corpus of spontaneous speech in French. Finally, ircamAlign has been used by composers at IRCAM. Note that a real-time version has subsequently been developed by J. Bloit and implemented in MaxMSP.
[ C4 ] P. Lanchantin, A. C. Morris, X. Rodet, C. Veaux, Automatic Phoneme Segmentation with Relaxed Textual Constraints, in E. L. R. A. (ELRA) (ed.), Proceedings of the Sixth International Language Resources and Evaluation (LREC08), Marrakech, Morocco, 2008.
[ C7 ] C. Veaux, P. Lanchantin and X. Rodet, Joint Prosodic and Segmental Unit Selection for Expressive Speech Synthesis, 7th Speech Synthesis Workshop (SSW7), Kyoto, Japan, 2010
[ AF2 ] G. Beller, C. Veaux, G. Degottex, N. Obin, P. Lanchantin et X. Rodet, IrcamCorpusTools : Plateforme Pour Les Corpus de Parole, Traitement Automatique des Langues, Vol. 49, No. 3, 2008
HMM-based Speech Synthesis
The principle of HMM-based speech synthesis proposed initially by the Nagoya Institute of Technology (Nitech) is the joint modeling of the spectrum (vocal tract), the fundamental frequency (source) and durations for each phoneme in context by a hidden Markov chain. During the synthesis, a macro-model is built from the concatenation of the HMMs corresponding to the phones in the context of the phonetic sequence to synthesize. The durations of the states are initially generated and then the trajectory of spectral parameters is estimated from a specific algorithm for spectral parameters generation taking into account the dependency between static and dynamic parameters. One advantage of this method compared to the synthesis of speech by units selection is that it only requires the storage of model parameters. It also allows precise control of the characteristics of the synthesis. The disadvantages of this type of synthesis are artifacts in the synthesized voice due to the glottal source modeling and the lack of natural due to the low variability of the prosody. To overcome these shortcomings we used the separation of vocal tract and glottal source separation method proposed by G. Degottex in [ C5 ]. On the other hand, we have shown with N. Obin the improvement made by using high-level syntactical features [ C6 ] and the possibilities for the synthesis of speaking style for different types of discourse genres in [ C11 ].
[ C5 ] P. Lanchantin, G. Degottex and X. Rodet, A HMM-Based Synthesis System Using a New Glottal Source and Vocal-Tract Separation Method, ICASSP10, Dallas, USA 2010,
[ C6 ] N. Obin, P. Lanchantin, M. Avanzi, A. Lacheret-Dujour and X. Rodet, Toward Improved HMM-Based Speech Synthesis Using High-Level Syntactical Features, Speech Prosody, Chicago, USA, 2010
[ C11 ] N. Obin, P. Lanchantin, A. Lacheret-Dujour and X. Rodet, Discrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design and Evaluation, submitted to Interspeech 2011, Florence, Italy, 2011
The principle of Voice conversion is to transform the signal from the voice of a source speaker, so it seems to have been issued by a target speaker. Conversion techniques studied at IRCAM by F. Villavicencio and then by myself under the ANR Affective Avatars are based on Gaussian Mixture Models (GMM). Typically, the joint distribution of acoustic source and target characteristics, modeled by a GMM, is estimated from a parallel corpus consisting of synchronous recordings of source and target speakers. The conversion function is then given by the conditional expectation to the acoustic characteristics of the source. My studies have focused both on the definition of the transformation function on its application to improve the quality of converted speech. Thus, all-pole modeling of the spectral envelope has been improved by the True-Envelope technics that enhances the quality of the synthesis and the characterization of the residual from the speaker. On the other hand, the use of the covariance matrix of the conditional distribution to the acoustic characteristics of the source allows a renormalization of the transformed characteristics in order to improve the quality of the converted signal. Finally, during the AngelStudio project, I proposed a method for Dynamic Model Selection (DMS [ C8, C10 ) which consits in using several models of different complexity and to select the most appropriate model for each frame of analysis during the conversion. The results of voice conversion obtained are very encouraging. Thus, it appears that the "personality" of the target speaker is well reproduced after processing and that the source speaker has largely disappeared. The main difficulty that remains is some degradation of sound quality of voice. However, other ways of improvements we are currently investigating [ C13 ] are expected to arrive at a usable quality, real-time, even for very demanding applications, such as artistic applications.
[ C8 ] P. Lanchantin and X. Rodet, Dynamic Model Selection for Spectral Voice Conversion, Interspeech'10, Makuhari, Japan, 2010
[ C10 ] P. Lanchantin and X. Rodet, Objective Evaluation of the Dynamic Model Selection for Spectral Voice Conversion, ICASSP2011, accepté, Prague, Czech Republic, 2011
[ C13 ] P. Lanchantin, N. Obin and X. Rodet, Extended Conditional GMM and Covariance Matrix Correction for Real-Time Spectral Voice Conversion, submitted to Interspeech 2011, Florence, Italy, 2011