||
本文为加拿大多伦多大学(作者:Navdeep Jaitly)的博士论文,共110页。
本文对深度神经网络隐马尔可夫模型(DNN-HMMS)在语音识别领域的应用做出了三大贡献。首先,利用深度学习的方法对语音识别中从语音数据库中学习到的特征进行了有效性研究。与以前的工作相比,以前的工作主要局限于使用传统的特征,如Mel倒谱系数和Mel对数滤波器组进行语音识别。我们展示了在原始信号上使用高斯限制玻尔兹曼机学习的特征可以达到接近最佳传统特征的精度,然而,这些特征是通过忽略领域知识的生成模型学习的。我们开发了使用“胶囊”发现与领域相关的有意义语义的特征的方法。为此,我们扩展了以前的自动编码器的转换工作,并提出了一种新的带有特定域解码器的自动编码器,用于从语音数据库中学习“胶囊”。研究表明,“胶囊”实例化参数可以与Mel对数滤波器组相结合,从而改进TIMIT上的电话识别性能。在WSJ上,我们在分类准确度方面取得了很大的进步,但单词错误率并没有提高。我们推测这可能是因为单词错误率的目标与一个框架的次音标类发音及框架错误率不匹配。
其次,提出了一种在语音数据集中进行数据扩充的方法。这类方法在目标识别方面取得了很大的进步,但在语音识别中却被忽视了。我们的数据扩充方法鼓励学习扬声器声道长度的不变性。本文给出了一种改进的方法,该方法可以降低在TIMIT上的电话误码率和在WSJ的14小时子集上的字误码率。
最后,我们开发了一种基于输入数据的使用更长范围目标模型进行学习的方法。该方法将多帧标记一起预测,并在解码过程中使用这些预测的几何平均值。该方法在TIMIT电话识别方面产生了最先进的结果,并且在WSJ上也取得了显著的进步。
This thesis makes three main contributions to the area of speechrecognition with Deep Neural Network - Hidden Markov Models (DNN-HMMs). Firstly,we explore the effectiveness of features learnt from speech databases usingDeep Learning for speech recognition. This contrasts with prior works that havelargely confined themselves to using traditional features such as Mel CepstralCoefficients and Mel log filter banks for speech recognition. We start byshowing that features learnt on raw signals using Gaussian-ReLU RestrictedBoltzmann Machines can achieve accuracy close to that achieved with the besttraditional features. These features are, however, learnt using a generativemodel that ignores domain knowledge. We develop methods to discover featuresthat are endowed with meaningful semantics that are relevant to the domainusing capsules. To this end, we extend previous work on transformingautoencoders and propose a new autoencoder with a domain-specific decoder tolearn capsules from speech databases. We show that capsule instantiation parameterscan be combined with Mel log filter banks to produce improvements in phonerecognition on TIMIT. On WSJ the word error rate does not improve, even thoughwe get strong gains in classification accuracy. We speculate this may bebecause of the mismatched objectives of word error rate over an utterance andframe error rate on the sub-phonetic class for a frame. Secondly, we develop amethod for data augmentation in speech datasets. Such methods result in stronggains in object recognition, but have largely been ignored in speechrecognition. Our data augmentation encourages the learning of invariance tovocal tract length of speakers. The method is shown to improve the phone errorrate on TIMIT and the word error rate on a 14 hour subset of WSJ. Lastly, wedevelop a method for learning and using a longer range model of targets,conditioned on the input. This method predicts the labels for multiple framestogether and uses a geometric average of these predictions during decoding. Itproduces state of the art results on phone recognition with TIMIT and alsoproduces significant gains on WSJ.
下载英文原文地址:
http://page2.dfpan.com/fs/7l5cejc2a2218209166/
更多精彩文章请关注微信号:
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-4-19 22:38
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社