Previous Contents Next

4   CSLU Speech Synthesis Research Group

MISCcslu:www [CSLU99]
KeyCSLU
TitleCSLU Speech Synthesis Research Group, Oregon Graduate Institute of Science and Technology
HowpublishedWWW page
Year1999
urlhttp://cslu.cse.ogi.edu/tts
pub-urlhttp://cslu.cse.ogi.edu/tts/publications
Notehttp://cslu.cse.ogi.edu/tts


ARTICLEcslu:ieeetsap98 [KMS98]
Author
F. Kossentini, M. Macon, M. Smith
TitleAudio coding using variable-depth multistage quantization
BooktitleIEEE Transactions on Speech and Audio Processing
Volume6
Year1998
Notewww [CSLU99]


INPROC.cslu:esca98mm [MCW98]
Author
M. W. Macon, A. E. Cronk, J. Wouters
TitleGeneralization and Discrimination in tree-structured unit selection
BooktitleProceedings of the 3rd ESCA/COCOSDA International Speech Synthesis Workshop
MonthNovember
Year1998
Notewww [CSLU99]
RemarksGreat overview of several unit selection methods, comprehensive biliography: origin of unit selection? [Sag88]. festival unit selection [HB96, BC95]. classification and regression trees [BFOS84a]. clustering and decision trees [BT97b, WCIS93, Nak94]. Mahalanobis distance [Don96]. decision trees for: speech recognition [NGY97], speech synthesis [HAea96]. data driven direct mapping with ANN [KCG96, TR]. distance measures for: coding [QBC88], ASR [NSRK85, HJ88], in general [GS97], concatenative speech synthesis [HC98, WM98]. PLP: [HM94]. Linear regression and correlation, Fisher transform: [Edw93]. Tree pruning: [CM98]. Masking effects: [Moo89].
AbstractConcatenative ``selection-based'' synthesis from large databases has emerged as a viable framework for TTS waveform generation. Unit selection algorithms attempt to predict the appropriateness of a particular database speech segment using only linguistic features output by text analysis and prosody prediction components of a synthesizer. All of these algorithms have in common a training or ``learning'' phase in which parameters are trained to select appropriate waveform segments for a given feature vector input. One approach to this step is to partition available data into clusters that can be indexed by linguistic features available at runtime. This method relies critically on two important principles: discrimination of fine phonetic details using a perceptually-motivated distance measure in training and generalization to unseen cases in selection. In this paper, we describe efforts to systematically investigate and improve these parts of the process.


INPROC.cslu:esca98kain [KM98a]
Author
A. Kain, M. W. Macon
TitlePersonalizing a speech synthesizer by voice adaptation
BooktitleProceedings of the 3rd ESCA/COCOSDA International Speech Synthesis Workshop
MonthNovember
Year1998
Pages225--230
Notewww [CSLU99]
AbstractA voice adaptation system enables users to quickly create new voices for a text-to-speech system, allowing for the personalization of the synthesis output. The system adapts to the pitch and spectrum of the target speaker, using a probabilistic, locally linear conversion function based on a Gaussian Mixture Model. Numerical and perceptual evaluations reveal insights into the correlation between adaptation quality and the amount of training data, the number of free parameters. A new joint density estimation algorithm is compared to a previous approach. Numerical errors are studied on the basis of broad phonetic categories. A data augmentation method for training data with incomplete phonetic coverage is investigated and found to maintain high speech quality while partially adapting to the target voice.


INPROC.cslu:icslp98cronk [CM98]
Author
Andrew E. Cronk, Michael W. Macon
TitleOptimized Stopping Criteria for Tree-Based Unit Selection in Concatenative Synthesis
OldtitleOptimization of stopping criteria for tree-structured unit selection
BooktitleProc. of International Conference on Spoken Language Processing
Volume5
MonthNovember
Year1998
Pages1951--1955
Notewww [CSLU99]
RemarksSummary: Method for growing optimal clustering tree (CART, as in [BFOS84a]). Not stopping with thresholds, but growing three completely (until no splittable clusters are left), and then pruning by recombining clusters by a greedy algorithm. Gives evaluation measure V-fold cross validation for tree quality. Clusters represent units with equivalent target cost. A best split of a cluster maximizes the decrease in data impurity (lower within-cluster variance of acoustic features). N.B.: Clustering of units is not classification, as the classes are not known in advance, and the method is unsupervised! Weighting in distortion measure using Mahalanobis distance as the inverse of the variance. References: [BC95], [BT97b], [BFOS84a], [Don96], [Fuk90] (CART tree evaluation criterion), [NGY97], [Nak94], [WCIS93].


INPROC.cslu:icslp98kain [KM98b]
Author
A. Kain, M. W. Macon
TitleText-to-speech voice adaptation from sparse training data
BooktitleProc. of International Conference on Spoken Language Processing
MonthNovember
Year1998
Pages2847--2850
Notewww [CSLU99]


INPROC.cslu:icslp98-paper [WM98]
Author
J. Wouters, M. W. Macon
TitleA Perceptual Evaluation of Distance Measures for Concatenative Speech Synthesis
BooktitleProc. of International Conference on Spoken Language Processing
MonthNovember
Year1998
Notewww [CSLU99]
AbstractIn concatenative synthesis, new utterances are created by concatenating segments (units) of recorded speech. When the segments are extracted from a large speech corpus, a key issue is to select segments that will sound natural in a given phonetic context. Distance measures are often used for this task. However, little is known about the perceptual relevance of these measures. More insightinto the relationship between computed distances and perceptual differences is needed to develop accurate unit selection algorithms, and to improve the quality of the resulting computer speech. In this paper, we develop a perceptual test to measure subtle phonetic differences between speech units. We use the perceptual data to evaluate several popular distance measures. The results show that distance measures that use frequency warping perform better than those that do not, and minimal extra advantage is gained by using weighted distances or delta features.


INPROC.cslu:cslutoolkit [SCdV+98]
Author
S. Sutton, R. Cole, J. de Villiers, J. Schalkwyk, P. Vermeulen, M. Macon, Y. Yan, E. Kaiser, B. Rundle, K. Shobaki, P. Hosom, A. Kain, J. Wouters, D. Massaro, M. Cohen
TitleUniversal Speech Tools: the CSLU Toolkit
BooktitleProc. of International Conference on Spoken Language Processing
MonthNovember
Year1998
Notewww [CSLU99]


INCOLL.cslu:german98 [MKC+98]
Author
M. W. Macon, A. Kain, A. E. Cronk, H. Meyer, K. Mueller, B. Saeuberlich, A. W. Black
TitleRapid Prototyping of a German TTS System
BooktitleTech. Rep. CSE-98-015
PublisherDepartment of Computer Science, Oregon Graduate Institute of Science and Technology
AddressPortland, OR
MonthSeptember
Year1998
Notewww [CSLU99]


INPROC.cslu:icassp98mm [MMLV98]
Author
M. W. Macon, A. McCree, W. M. Lai, V. Viswanathan
TitleEfficient Analysis/Synthesis of Percussion Musical Instrument Sounds Using an All-Pole Model
BooktitleProceedings of the International Conference on Acoustics, Speech, and Signal Processing
Volume6
PublisherSpeech
MonthMay
Year1998
Pages3589--3592
Notewww [CSLU99]
AbstractIt is well-known that an impulse-excited, all-pole filter is capable of representing many physical phenomena, including the oscillatory modes of percussion musical instruments like woodblocks, xylophones, or chimes. In contrast to the more common application of all-pole models to speech, however, practical problems arise in music synthesis due to the location of poles very close to the unit circle. The objective of this work was to develop algorithms to find excitation and filter parameters for synthesis of percussion instrument sounds using only an inexpensive all-pole filter chip (TI TSP50C1x). The paper describes analysis methods for dealing with pole locations near the unit circle, as well as a general method for modeling the transient attackcharacteristics of a particular sound while independently controlling the amplitudes of each oscillatory mode.


INPROC.cslu:icassp98kain [KM98c]
Author
Alexander Kain, Michael W Macon
TitleSpectral Voice Conversion for Text-to-Speech Synthesis
Year1998
BooktitleProceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP'98)
Pages285--288
Notewww [CSLU99]
AbstractA new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speaker's average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with an increase in training set size.


INCOLL.cslu:ogireslpc97 [MCWK97]
Author
M. W. Macon, A. E. Cronk, J. Wouters, A. Kain
TitleOGIresLPC: Diphone synthesizer using residual-excited linear prediction
BooktitleTech. Rep. CSE-97-007
PublisherDepartment of Computer Science, Oregon Graduate Institute of Science and Technology
MonthSeptember
Year1997
AddressPortland, OR
Notewww [CSLU99]


INPROC.cslu:aes97 [MJLO+97a]
Author
M. W. Macon, L. Jensen-Link, J. Oliverio, M. Clements, E. B. George
TitleConcatenation-based MIDI-to-singing voice synthesis
Booktitle103rd Meeting of the Audio Engineering Society
PublisherNew York
Year1997
Notewww [CSLU99]
AbstractIn this paper, we propose a system for synthesizing the human singing voice and the musical subtleties that accompany it. The system, Lyricos, employs a concatenation-based text-to-speech method to synthesize arbitrary lyrics in a given language. Using information contained in a regular MIDI file, the system chooses units, represented as sinusoidal waveform model parameters, from an inventory of data collected from a professional singer, and concatenates these to form arbitrary lyrical phrases. Standard MIDI messages control parameters for the addition of vibrato, spectral tilt, and dynamic musical expression, resulting in a very natural-sounding singing voice.


INPROC.cslu:trsap97 [MC97]
Author
M. W. Macon, M. A. Clements
TitleSinusoidal modeling and modification of unvoiced speech
BooktitleIEEE Transactions on Speech and Audio Processing
Volume5
MonthNovember
Year1997
Pages557--560
Number6
Notewww [CSLU99]
AbstractAlthough sinusoidal models have been shown to be useful for time-scale and pitch modification of voiced speech, objectionable artifacts often arise when such models are applied to unvoiced speech. This correspondence presents a sinusoidal model-based speech modification algorithm that preserves the natural character of unvoiced speech sounds after pitch and time-scale modification, eliminating commonly-encountered artifacts. This advance is accomplished via a perceptually-motivated modulation of the sinusoidal component phases that mitigates artifacts in the reconstructed signal after time-scale and pitch modification


INPROC.cslu:icassp97 [MJLO+97b]
Author
Michael Macon, Leslie Jensen-Link, James Oliverio, Mark A. Clements, E. Bryan George
TitleA Singing Voice Synthesis System Based on Sinusoidal Modeling
Year1997
BooktitleProceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97)
Pages435--438
Notewww [CSLU99]
AbstractAlthough sinusoidal models have been demonstrated to be capable of high-quality musical instrument synthesis, speech modification, and speech synthesis, little exploration of the application of these models to the synthesis of singing voice has been undertaken. In this paper, we propose a system framework similar to that employed in concatenation-based text-to-speech synthesizers, and describe its extension to the synthesis of singing voice. The power and flexibility of the sinusoidal model used in the waveform synthesis portion of the system enables high-quality, computationally-effcient synthesis and the incorporation of musical qualities such as vibrato and spectral tilt variation. Modeling of segmental phonetic characteristics is achieved by employing a``unit selection'' procedure that selects sinusoidally-modeled segments from an inventory of singing voice data collected from ahuman vocalist. The system, called Lyricos, is capable of synthesizing very natural-sounding singing that maintains the characteristics and perceived identityof the analyzed vocalist.


INPROC.cslu:icassp96 [MC96]
AddressAtlanta, USA
Author
Michael W. Macon, Mark A. Clements
BooktitleProceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP'96)
TitleSpeech Concatenation and Synthesis Using an Overlap--Add Sinusoidal Model
Year1996
Volume1
Pages361--364
Notewww [CSLU99]
AbstractIn this paper, an algorithm for the concatenation of speech signal segments taken from disjoint utterances is presented. The algorithm is based on the Analysis-by-Synthesis/Overlap-Add (ABS/OLA) sinusoidal model, which is capable of performing high quality pitch- and time-scale modification of both speech and music signals. With the incorporation of concatenation and smoothing techniques, the model is capable of smoothing the transitions between separately-analyzed speech segments by matching the time- and frequency-domain characteristics of the signals at their boundaries. The application of these techniques in a text-to-speech system based on concatenation of diphone sinusoidal models is also presented.


INPROC.cslu:jasa95 [MC95]
Author
M. W. Macon, M. A. Clements
TitleSpeech synthesis based on an overlap-add sinusoidal model
BooktitleJ. of the Acoustical Society of America
Volume97
PublisherPt. 2
MonthMay
Year1995
Pages3246
Number5
Notewww [CSLU99]



Previous Contents Next