本文为法国巴黎第六大学皮埃尔与玛丽居里大学(作者:Stefan Huber)的博士论文,共210页。


语音变换(Voice ConversionVC)的目的是转换源说话人的语音特征,使其被认为是由目标说话人发出的。VC的原理是定义从一个源说话人语音到一个目标说话人语音的转换映射函数。常用的最新(STARTVC系统的变换功能能即时适应源语音的特点。虽然最近的VC系统在初始方法的变换质量方面取得了相当大的进展,但是质量仍然不够。在VC技术能够在专业的工业环境中使用之前,需要进行大量的改进。












Voice Conversion (VC) aims at transformingthe characteristics of a source speaker’s voice in such a way that it will beperceived as being uttered by a target speaker. The principle of VC is todefine mapping functions for the conversion from one source speaker’s voice toone target speaker’s voice. The transformation functions of commonSTAte-of-the-ART (START) VC system adapt instantaneously to the characteristicsof the source voice. While recent VC systems have made considerable progressover the conversion quality of initial approaches, the quality is neverthelessnot yet sufficient. Considerable improvements are required before VC techniquescan be used in a professional industrial environment. The objective of thisthesis is to augment the quality of Voice Conversion to facilitate itsindustrial applicability to a reasonable extent. The basic properties ofdifferent START algorithms for Voice Conversion are discussed on theirintrinsic advantages and shortcomings. Based on experimental evaluations of oneGMM-based START VC approach the conclusion is that most VC systems which relyon statistical models are, due to averaging effect of the linear regression,less appropriate to achieve a high enough similarity score to the targetspeaker required for industrial usage. The contributions established throughoutthe work for this thesis lie in the extended means to a) model the glottalexcitation source, b) model a voice descriptor set using a novel speech systembased on an extended source-filter model, and c) further advance IRCAMs novelVC system by combining it with the contributions of a) and b). a) Improvementsto estimate the shape of the deterministic part of the glottal excitationsource from speech signals are presented in this thesis. A STAte-of-the-ARTmethod based on phase minimization to estimate the shape parameter Rd of theglottal source model LF has been considerably enhanced. First, the adaptationand extension of the utilized Rd parameter range avoids inconsistencies in theframe-based estimator. Second, the utilization of Viterbi smoothing suppressesunnatural jumps of the estimated glottal source parameter contour withinshort-time segments. Third, the exploitation of the correlation of otherco-varying voice descriptors to additionally steer the Viterbi algorithmaugments the estimators robustness, especially in segments with few stableharmonic sinusoids available where the phased minimization based paradigm ismore error prone. b) The estimation of the glottal excitation source isutilized to extract the contribution of the Vocal Tract Filter (VTF) from thespectral envelope by means of dividing the spectral envelope of the glottalpulse. This facilitates altering the voice quality of a given speech phrase bymeans of exciting the VTF with altered glottal pulse shapes. A novel speechsystem is presented which allows for the analysis, transformation and synthesisof different voice descriptors such as glottal excitation source, intensity,fundamental frequency and the voiced / unvoiced frequency boundary. Theproposed speech framework PSY derives from Parametric Speech SYnthesis toindicate its fully parametric design to construct a speech phrase forsynthesis. PSY is based on the separate processing of the voiced deterministicand the unvoiced stochastic part of a speech signal. Each voice descriptor andVTF or spectral envelope required for synthesis can be introduced from the sameor different speakers. This flexibility allows for many voice modificationpossibilities or the generation of a human voice avatar. c) Please note thatthis part of the abstract is confidential and can therefore not be shown forthe time being. It is related to IRCAM’s novel VC system which is currentlypatent pending.








6 PSY:一种灵活的参数化语音合成系统

7 coVoC:级联语音变换





