INPROC. | tcts:euspico98 [DMD98] |
Author | O. Deroo , F. Malfrere , T. Dutoit |
Title | Comparaison of two different alignment systems: speech synthesis vs. Hybrid HMM/ANN |
Booktitle | Proc. European Conference on Signal Processing (EUSIPCO'98) |
Address | Greece |
Year | 1998 |
Pages | 1161--1164 |
Note | www [TCTS99], same content as [MDD98] (but less references) |
url | http://tcts.fpms.ac.be/publications/papers/1998/eusipco98_odfmtd.zip |
Abstract | In this paper we compared two different methods for phonetically labeling a French database. The first one is based on the temporal alignment of the speech signal on a high quality synthetic speech pattern and the second one uses a hybrid HMM/ANN system. Both systems have been evaluated on French read utterances from a single speaker never seen in the training stage of the HMM/ANN system and manually segmented. This study outline the advantages and drawbacks of both methods. The high quality speech synthetic system has the great advantage that no training stage (hence no labeled database) is needed, while the classical HMM/ANN system allows easily multiple phonetic transcriptions (phonetic lattice). We deduce a method for the automatic constitution of large phonetically and prosodically labeled speech databases based on using the synthetic speech segmentation tool in order to bootstrap the training process of our hybrid HMM/ANN system. The importance of such segmentation tools will be a key point for the development of improved speech synthesis and recognition systems. All the experiments reported in this article related to the hybrid HMM/ANN system have been realized with the STRUT [3] software. |
INPROC. | tcts:tsd98 [DMP+98] |
Title | EULER: Multi-Lingual Text-to-Speech Project |
Pages | 27--32 |
Author | T. Dutoit , F. Malfrère , V. Pagel , M. Bagein P. Mertens , A. Ruelle , A. Gilman |
Booktitle | Proceedings of the First Workshop on Text, Speech, Dialogue --- TSD'98 |
Year | 1998 |
Editor | Petr Sojka , Václav Matousek , Karel Pala , Ivan Kopecek |
Address | Brno, Czech Republic |
Month | September |
Publisher | Masaryk University Press |
Note | www [TCTS99]Electronic version: tcts/tsd98tdfmvppmmbarag.ps.* |
Remarks | modularity |
Abstract | Text-to-speech systems requires simultaneously an abstract linguistic analysis, an acoustic linguistic analysis and a final digital processing stage. The aim of the project presented in this paper is to obtain a set of text-to-speech synthesizers for as many voices, languages and dialects as possible, free of use for non-commercial and non-military applications. This project is an extension of the MBROLA projects. MBROLA is a speech synthesizer that is freely distributed for non-commercial purposes. A multi-lingual speech segmentation and prosody transplantation tool called MBROLIGN has also been developed and freely distributed. Other labs have also recently distributed for free important tools for speech synthesis like Festival from University o f Edinburgh or the MULTEXT project of the University de Provence. The purpose of this paper is to present the EULER project, which will try to integrate all these results, to Eastern European potential partners, so as to increase the dissemination of the important results of MBROLA and MBROLIGN projects and stimulate East/West collaboration on TTS synthesis. |
INPROC. | tcts:icslp98-fmodtd [MDD98] |
Author | F. Malfrere , O. Deroo , T. Dutoit |
Title | Phonetic Alignement : Speech Synthesis Based Vs. Hybrid HMM/ANN |
Booktitle | Proc. International Conference on Speech and Language Processing |
Address | Sidney, Australia |
Year | 1998 |
Pages | 1571--1574 |
Note | www [TCTS99], same content as [DMD98] (with more references) |
url | http://tcts.fpms.ac.be/publications/papers/1998/icslp98_fmodtd.zip |
Abstract | In this paper we compare two different methods for phonetically labeling a speech database. The first approach is based on the alignment of the speech signal on a high quality synthetic speech pattern, and the second one uses a hybrid HMM/ANN system. Both systems have been evaluated on French read utterances from a speaker never seen in the training stage of the HMM/ANN system and manually segmented. This study outlines the advantages and drawbacks of both methods. The high quality speech synthetic system has the great advantage that no training stage is needed, while the classical HMM/ANN system easily allows multiple phonetic transcriptions. We deduce a method for the automatic constitution of phonetically labeled speech databases based on using the synthetic speech segmentation tool to bootstrap the training process of our hybrid HMM/ANN system. The importance of such segmentation tools will be a key point for the development of improved speech synthesis and recognition systems. |
INPROC. | tcts:iscas97 [MD97a] |
Author | |
Title | Speech Synthesis for Text-To-Speech Alignment and Prosodic Feature Extraction |
Booktitle | Proc. ISCAS 97 |
Address | Hong-Kong |
Year | 1997 |
Pages | 2637--2640 |
Note | www [TCTS99] |
url | http://tcts.fpms.ac.be/publications/papers/1997/iscas97_fmtd.zip |
Remarks | Recent developments in prosody generation have highlighted the potential interest of machine learning techniques such as multilayer perceptrons [Tra92], linear regression techniques [SK92], classification and regression trees [Hir91], or statistical techniques [MPH93], based on the automatic analysis of large prosodically labeled corpora. Only the segmental features of the reference signal used in alignment. Assumption: the segmental and suprasegmental features are approximately uncorrelated. Keep only the perceptually relevant F0 cues, perceptual stylization, based on a model of tonal perception [alessandro95]. Robust cepstrum by sinusoidal weighting [GL88]. Derivative of cepstrum [SR88]. |
Abstract | The aim of this paper is to present a new and promising approach of the text--to--speech alignment problem. For this purpose, an original idea is developed : a high quality digital speech synthesizer is used to create a reference speech pattern used during the alignment process. The system has been used and tested to extract the prosodic features of read French utterances. The results show a segmentation error rate of about 8%. This system will be a powerful tool for the automatic creation of large prosodically labeled databases and for research on automatic prosody generation. |
INPROC. | tcts:eurosp97 [SDS97] |
Author | Yannis Stylianou , Thierry Dutoit , Juergen Schroeter |
Title | Diphone Concatenation Using a Harmonic Plus Noise Model of Speech |
Booktitle | Proc. Eurospeech '97 |
Address | Rhodes, Greece |
Month | September |
Year | 1997 |
Pages | 613--616 |
Note | www [TCTS99]Electronic version: tcts/hnmconc.ps.* |
Remarks | Important! HNM (Marine) basis paper, pitch synchronous. Diphone smoothing in region of quasi-stationarity. Additive better for concatenation than PSOLA. References: [DG96] (non pitch-synchronous hybrid harmonic/stochastic synthesis, real-time generation of signals from spectral representation), [SLM95] (phase treatment, modifications), [Mac96] (non pitch synchronous harmonic modeling). |
Abstract | In this paper we present a high-quality text-to-speech system using diphones. The system is based on a Harmonic plus Noise (HNM) representation of the speech signal. HNM is a pitch-synchronous analysis-synthesis system but does not require pitch marks to be determined as necessary in PSOLA-based methods. HNM assumes the speech signal to be composed of a periodic part and a stochastic part. As a result, different prosody and spectral envelope modification methods can be applied to each part, yielding more natural-sounding synthetic speech. The fully parametric representation of speech using HNM also provides a straightforward way of smoothing diphone boundaries. Informal listening tests, using natural prosody, have shown that the synthetic speech quality is close to the quality of the original sentences, without smoothing problems and without buzziness or other oddities observed with other speech representations used for TTS. |