大工至善|大学至真分享 http://blog.sciencenet.cn/u/lcj2212916

博文

[转载]【信息技术】【2018.02】稳健的基于相位的语音信号处理

已有 1310 次阅读 2020-2-11 19:47 |系统分类:科研笔记|文章来源:转载

本文为英国谢菲尔德大学(作者:Erfan Loweimi)的博士论文,共304页。

 

傅立叶分析在语音信号处理中起着关键作用。作为一个复数,它可以用幅度谱和相位谱以极性形式表示。幅度谱在语音处理的各个方面都有着广泛应用。然而,相位谱并不是语音信号处理的一个很有吸引力的起点。相对于精细和粗糙结构与语音感知有明显关系的幅度谱而言,相位谱难以解释和处理。事实上,没有一个有意义的趋势或极值可以促进建模过程。尽管如此,语音相位谱最近再次受到关注。大量工作表明,它可以有效地应用于多种语音处理中。现在基于相位的语音处理潜力已经确定,因此需要一个基本模型来帮助理解相位编码语音信息的方式。

 

本文提出了一种新的相位域声源滤波模型,该模型允许通过相位处理对语音声道(滤波器)和激励(源)分量进行反褶积。该模型利用Hilbert变换,显示了激励和声道元素在相位域中的混合,并提供了通过相位操作有效分离源和滤波器成分的框架。为了研究该方法的有效性,从用于自动语音识别(ASR)的相位滤波器部分提取一组特征,并利用相位的源部分进行基频估计。对两种情况下的精度和鲁棒性进行了说明和讨论。此外,在Hilbert变换中用广义对数函数代替对数函数,并通过回归滤波器计算群时延,从而进一步改进了该方法。

 

研究了特征提取过程中相位谱的统计分布及其表示方法。结果表明,相位谱呈钟形分布。一些统计规范化方法,如均值-方差规范化、拉普拉斯化、高斯化和直方图均衡化,成功地应用于基于相位的特征,并导致了显著的鲁棒性改进。

 

通过使用统计正规化和广义对数函数实现的鲁棒性增益鼓励使用更先进的基于模型的统计技术,如向量泰勒级数(VTSVTS在其原始公式中假设使用log函数进行压缩。为了同时利用VTS和广义对数函数,首先提出了一个新的公式,将两者合并为一个统一的框架,称为广义VTSgVTS)。为了充分利用gVTS框架,提出了一种新的信道噪声估计方法,然后研究了gVTS框架的扩展和信道估计用于群延迟域的方法。文中对所提出的问题进行了分析和讨论,提出了一些解决办法,并导出了相应的计算公式。此外,还研究了相位延迟域和群延迟域中的加性噪声和信道失真影响,并将结果用于推导gVTS方程。HMM/GMM中的Aurora-4 ASR任务和基于DNN的瓶颈系统在clean和多样式训练模式下的实验结果证实了该方法在处理加性噪声和信道噪声方面的有效性。

 

The Fourier analysis plays a key role in speech signal processing. As a complex quantity, it can be expressed in the polar form using the magnitude and phase spectra. The magnitude spectrum is widely used in almost every corner of speech processing. However, the phase spectrum is not an obviously appealing start point for processing the speech signal. In contrast to the magnitude spectrum whose fine and coarse structures have a clear relation to speech perception, the phase spectrum is difficult to interpret and manipulate. In fact, there is not a meaningful trend or extrema which may facilitate the modelling process. Nonetheless, the speech phase spectrum has recently gained renewed attention. An expanding body of work is showing that it can be usefully employed in a multitude of speech processing applications.Now that the potential for the phase-based speech processing has been established, there is a need for a fundamental model to help understand the way in which phase encodes speech information.In this thesis a novel phase-domain source-flter model is proposed that allows for deconvolution of the speech vocal tract (flter) and excitation (source) components through phase processing. This model utilises the Hilbert transform, shows how the excitation and vocal tract elements mix in the phase domain and provides a framework for efficiently segregating the source and filter components through phase manipulation. To investigate the efficacy of the suggested approach, a set of features is extracted from the phase filter part for automatic speech recognition (ASR) and the source part of the phase is utilised for fundamental frequency estimation. Accuracy and robustness in both cases are illustrated and discussed. In addition, the proposed approach is improved by replacing the log with the generalised logarithmic function in the Hilbert transform and also by computing the group delay via regression filter.Furthermore, statistical distribution of the phase spectrum and its representations along the feature extraction pipeline are studied. It is illustrated that the phase spectrum has a bell-shaped distribution. Some statistical normalisation methods such as mean-variance normalisation, Laplacianisation, Gaussianisation and Histogram equalisation are successfully applied to the phase-based features and lead to a significant robustness improvement.

The robustness gain achieved through using statistical normalisation and generalized logarithmic function encouraged the use of more advanced model-based statistical techniques such as vector Taylor Series (VTS). VTS in its original formulation assumes usage of the log function for compression. In order to simultaneously take advantage of the VTS and generalised logarithmic function, a new formulation is first developed to merge both into a unified framework called generalised VTS (gVTS). Also in order to leverage the gVTS framework, a novel channel noise estimation method is developed. The extensions of the gVTS framework and the proposed channel estimation to the group delay domain are then explored. The problems it presents are analysed and discussed, some solutions are proposed and fnally the corresponding formulae are derived. Moreover, the effect of additive noise and channel distortion in the phase and group delay domains are scrutinised and the results are utilised in deriving the gVTS equations. Experimental results in the Aurora-4 ASR task in an HMM/GMM set up along with a DNN-based bottleneck system in the clean and multi-style training modes confirmed the efficacy of the proposed approach in dealing with both additive and channel noise.

 

1. 引言

2. 背景与相关工作

3. 相位信息

4. 相位域的源-滤波器分离

5. 用于鲁棒ASR的相位/群时延域的广义VTS

6. 结论与未来工作展望

附录希尔伯特变换

附录用于鲁棒ASR的广义向量泰勒级数(gVTS)方法

附录基于广义向量泰勒级数的信道噪声估计

附录用于ASR的深度神经网络

附录使用的数据库描述

附录特征提取技术回顾


更多精彩文章请关注公众号:qrcode_for_gh_60b944f6c215_258.jpg



https://blog.sciencenet.cn/blog-69686-1218064.html

上一篇:[转载]【信息技术】【2014.01】尽可能严格的图像配准研究
下一篇:[转载]【计算机科学】【2017.06】【含源码】基于深度学习的语言建模
收藏 IP: 114.102.131.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-16 13:13

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部