大工至善|大学至真分享 http://blog.sciencenet.cn/u/lcj2212916

博文

[转载]【信息技术】【2016.06】基于扩展语音特征建模和转换的语音变换

已有 270 次阅读 2020-1-20 17:21 |系统分类:科研笔记|文章来源:转载

本文为法国巴黎第六大学皮埃尔与玛丽居里大学(作者:Stefan Huber)的博士论文,共210页。

 

语音变换(Voice ConversionVC)的目的是转换源说话人的语音特征,使其被认为是由目标说话人发出的。VC的原理是定义从一个源说话人语音到一个目标说话人语音的转换映射函数。常用的最新(STARTVC系统的变换功能能即时适应源语音的特点。虽然最近的VC系统在初始方法的变换质量方面取得了相当大的进展,但是质量仍然不够。在VC技术能够在专业的工业环境中使用之前,需要进行大量的改进。

 

本文的目的是提高语音变换的质量,使其在合理范围内具有工业应用价值。讨论了各种语音变换启动算法的基本特性,分析了它们的优缺点。通过对一种基于GMMSTART-VC方法的实验评价,得出结论:由于线性回归的平均效应,大多数依赖统计模型的VC系统不太适合实现工业应用所需的与目标说话人足够高的相似度

 

本文的主要贡献在于:a)建立了声门激励源模型;b)建立了基于扩展源滤波器模型的语音描述集模型;c)结合a)和b)的贡献,进一步改进了IRCAMs的新VC系统。

 

a)提出了利用语音信号估计部分声门激励源以确定形状的改进方法。提出了一种基于相位最小化的声门源模型形状参数Rd的估计方法。首先,所利用的Rd参数范围的自适应和扩展避免了基于帧的估计器中的不一致性。其次,利用Viterbi平滑抑制了估计的声门源参数轮廓在短时间段内的非自然跳跃。第三,利用其他共变语音描述符的相关性来额外引导Viterbi算法,增强了估计器的稳健性,特别是在几乎没有稳定谐波正弦的分段中基于相位最小化的范例更容易出错的情况下。

 

b)利用声门激励源的估计,通过分割声门脉冲的频谱包络,从频谱包络中提取声道滤波器(VTF)的贡献。这有助于通过用改变的声门脉冲形状刺激VTF来改变给定语音短语的语音质量。提出了一种新的语音系统,可以分析、转换和合成声门激励源、强度、基频和浊音/清音频率边界等不同的语音描述符。提出的语音框架PSY源于参数化语音合成,表示其完全参数化设计,从而构造一个语音合成短语。PSY是基于语音信号的浊音确定性部分和清音随机部分的分离处理框架。合成所需的每个语音描述符和VTF或频谱包络可以从相同或不同的扬声器引入。这种灵活性允许许多语音修改的可能性或人类声音化身的产生。

 

c)请注意,摘要的这一部分是保密的,因此暂时无法显示。这与IRCAM公司目前正在申请专利的新型VC系统有关。

 

Voice Conversion (VC) aims at transformingthe characteristics of a source speaker’s voice in such a way that it will beperceived as being uttered by a target speaker. The principle of VC is todefine mapping functions for the conversion from one source speaker’s voice toone target speaker’s voice. The transformation functions of commonSTAte-of-the-ART (START) VC system adapt instantaneously to the characteristicsof the source voice. While recent VC systems have made considerable progressover the conversion quality of initial approaches, the quality is neverthelessnot yet sufficient. Considerable improvements are required before VC techniquescan be used in a professional industrial environment. The objective of thisthesis is to augment the quality of Voice Conversion to facilitate itsindustrial applicability to a reasonable extent. The basic properties ofdifferent START algorithms for Voice Conversion are discussed on theirintrinsic advantages and shortcomings. Based on experimental evaluations of oneGMM-based START VC approach the conclusion is that most VC systems which relyon statistical models are, due to averaging effect of the linear regression,less appropriate to achieve a high enough similarity score to the targetspeaker required for industrial usage. The contributions established throughoutthe work for this thesis lie in the extended means to a) model the glottalexcitation source, b) model a voice descriptor set using a novel speech systembased on an extended source-filter model, and c) further advance IRCAMs novelVC system by combining it with the contributions of a) and b). a) Improvementsto estimate the shape of the deterministic part of the glottal excitationsource from speech signals are presented in this thesis. A STAte-of-the-ARTmethod based on phase minimization to estimate the shape parameter Rd of theglottal source model LF has been considerably enhanced. First, the adaptationand extension of the utilized Rd parameter range avoids inconsistencies in theframe-based estimator. Second, the utilization of Viterbi smoothing suppressesunnatural jumps of the estimated glottal source parameter contour withinshort-time segments. Third, the exploitation of the correlation of otherco-varying voice descriptors to additionally steer the Viterbi algorithmaugments the estimators robustness, especially in segments with few stableharmonic sinusoids available where the phased minimization based paradigm ismore error prone. b) The estimation of the glottal excitation source isutilized to extract the contribution of the Vocal Tract Filter (VTF) from thespectral envelope by means of dividing the spectral envelope of the glottalpulse. This facilitates altering the voice quality of a given speech phrase bymeans of exciting the VTF with altered glottal pulse shapes. A novel speechsystem is presented which allows for the analysis, transformation and synthesisof different voice descriptors such as glottal excitation source, intensity,fundamental frequency and the voiced / unvoiced frequency boundary. Theproposed speech framework PSY derives from Parametric Speech SYnthesis toindicate its fully parametric design to construct a speech phrase forsynthesis. PSY is based on the separate processing of the voiced deterministicand the unvoiced stochastic part of a speech signal. Each voice descriptor andVTF or spectral envelope required for synthesis can be introduced from the sameor different speakers. This flexibility allows for many voice modificationpossibilities or the generation of a human voice avatar. c) Please note thatthis part of the abstract is confidential and can therefore not be shown forthe time being. It is related to IRCAM’s novel VC system which is currentlypatent pending.

 

 

语音变换VC简介与回顾

语音信号处理的最新进展

声门刺激源模型的最新进展

语音变换的最新进展

本文在声门刺激源模型研究上的贡献

6 PSY:一种灵活的参数化语音合成系统

7 coVoC:级联语音变换

总结与未来工作展望

附件


更多精彩文章请关注公众号:qrcode_for_gh_60b944f6c215_258.jpg



http://blog.sciencenet.cn/blog-69686-1214999.html

上一篇:[转载]【信息技术】【2009】基于特征的图像配准
下一篇:[转载]【计算机科学】【2017.12】图像分类与回归的深度神经网络模型

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2020-2-23 15:22

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部