Atlantis Press China分享 http://blog.sciencenet.cn/u/atlantispress 数字出版平台,开放获取先锋

博文

文章荐读 | 语音情感识别:一种无监督学习方法 精选

已有 5738 次阅读 2021-1-18 11:25 |个人分类:文章荐读|系统分类:论文交流

文章荐读 | 语音情感识别:一种无监督学习方法

小编导读

信息技术的快速发展使得人类与计算机的关系日益密切,情感识别作为智能人机交互的关键技术之一,扮演着重要角色。其中,表达性语音合成与识别建模是由有监督分类器完成的。这意味着需要预先进行标记和数据预处理,成本会随着数据库的大小而增加,此外还存在犯错误的风险。因此,为了避免标注的成本,同时减少由于缺乏数据而导致过度拟合的风险,无监督学习似乎是一种合适的替代方法。来自意大利热那亚大学University of Genoa)和突尼斯埃尔马纳尔大学(University Tunis El Manar)的学者们在期刊 International Journal of Computional Intelligence SystemsIJCIS)上发表了题为Emotion Recognition from Speech: An Unsupervised Learning Approach” 的文章,研究了基于无监督学习法的语音情感特征识别。

要点介绍

语音的情感识别依赖于建立的心理模型。例如,Ekman模型[9]指出,有六种基本情绪,即中性、愤怒、恐惧、惊讶、喜悦和悲伤,无论语言、文化或手段(言语、面部表情等)如何,都能被识别。更详细的情感模型依赖于连续的维度,而不是原子的基本情感。罗素的周旋模型[10]表明,情绪可以在二维空间中表现,其中x轴代表价,y轴代表觉醒(参见图1)。此外,Plutchik提出了一个三维模型[11],它将基本模型和二维模型结合起来。因此,外在情感是内在情感的结合。

图1.png

1 情绪数据库(EMO-DB)中情绪类的配价/唤醒模型[20]。本图来自[8]

本文的研究目标是使用一个基于无监督学习的工作流程从语音信号中识别情感特征。所采用的方法包括:

a)特征分析的组合技术,即自动编码器的特征嵌入和方差分析(ANOVA)或互信息(MI)的特征选择;

b)不同的聚类方法,如使用K-均值的清晰聚类和使用概率的模糊聚类,可能性和分级可能性c-均值;

c)一种使用隶属度矩阵和分析情感识别的新方法。

本文的主要贡献在于提出了一种新的基于聚类的语音情感内容分析方法,该方法使用无监督学习方法(如自动编码器)来提取特征。据我们所知,这是第一个完全依赖于无监督学习的工作,无论是用于特征提取还是用于语音聚类。这项工作是欧洲模糊逻辑与技术学会第11届会议结果的扩展[7]

文章的2介绍了语音情感识别的研究现状,包括数据库、特征集、情感表示模型以及无监督学习的应用。3介绍了本研究所使用的语音材料,包括选择的表达性语音数据库、采用的标准特征集和采用的心理情感模型。4详细介绍了本研究所采用的方法。5报告了实验结果和对实验工作的解释;最后结论(6)提出了一些评论和观点。

图4.png

4 利用主成分分析(PCA)将单个和分组的情绪类别分布可视化,以便在三维中投影特征:(a)对于单个情绪,三维投影表明言语间的标准特征没有足够的区分性;(b)将情绪分组以减少特征散射。

图5.png

5 实验过程:预处理包括手工特征计算、特征嵌入和特征选择;聚类采用crispK-means)和/fuzzy方法,聚类标记用于恢复聚类中的情感类。评价结果为模糊度和隶属度矩阵之和。

图6.png

6 用方差分析(ANOVA)方法对K-均值和不同等级可能性c-均值(GPCM)方法(二维主成分分析(PCA)投影)选择的21个聚类和96个特征对7个类别进行聚类结果:尽管二维投影没有显示出明显的不相交聚类,K-meansGPCM方法似乎能够分离一些类,例如类1和类2

图7.png

7 混淆矩阵得分(aK-均值,(b)分级可能性c-均值(GPCM),聚类后标记,使用方差分析(ANOVA)选择的192个特征。

结论:这项工作的重点是提出了一种基于模糊聚类的情感识别方法。其主要思想是根据基本情感对语音进行聚类,使用(a)无监督学习进行特征提取,并使用自动编码器进行更精确的特征嵌入,(b)模糊聚类的新进展,如可能性和分级可能性c-均值,以及概率c-均值,从言语中识别情感。此外,出于评估的目的,使用K-均值算法处理crisp方法。我们还对模型进行了一些调整,包括使用自动编码器进行特征嵌入,使用方差分析和MI分析进行特征选择,最后改变可能性模型的参数。此外,使用更多的类有助于提高识别率。另外,参数的最优值的选择对提高可能性c-均值模型和分级可能性c-均值模型的性能也有一定的影响。

实验结果证实了使用模糊聚类作为监督学习的另一种工具进行情感识别的有效性。无论是对单个情绪还是对多组情绪,crisp聚类和fuzzy聚类的表现几乎相同,总的准确率接近60%,对愤怒和悲伤等情绪的准确率高于80%,回忆能力相当。这可能是一个非常有用的替代方法,情感识别在大型,特别是未标记的语音数据集。

在经典混淆矩阵的基础上,提出了一种新的基于隶属度和矩阵的混淆矩阵表示方法。对模糊聚类矩阵的分析显示出与混淆矩阵相似的行为,在混淆矩阵中,高度认可的情绪倾向于对同一原始情绪具有较高的隶属度,而错误分类的情绪倾向于与其他情绪共享其隶属度之和。这种表征可以研究每一种基本情绪对另一种基本情绪的依赖性,并且可以成为情绪分析的一个有用工具,以了解语音信号是如何传递情绪的以及它们是如何被感知的。

参考文献 References

[1]  B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, Acoustic emotion recognition: a benchmark comparison of per- formances, in IEEE Workshop on Automatic Speech Recogni- tion Understanding (ASRU 2009), IEEE, Merano, Italy, 2009, pp. 552–557.

[2]  C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, et al., Iemocap: interactive emotional dyadic motion cap- ture database, Lang. Res. Eval. 42 (2008), 335.

[3] H.M. Fayek, M. Lech, L. Cavedon, Evaluating deep learning archi-tectures for speech emotion recognition, Neural Netw. 92 (2017), 60–68.

[4]  E. Avots, T. Sapiński, M. Bachmann, D. Kamińska, Audiovi- sual emotion recognition in wild, Mach. Vis. Appl. 30 (2019), 975–985.

[5] M. Belkin, S. Ma, S. Mandal, Tounder stand deep learning we need to understand kernel learning, arXiv preprint, arXiv:1802.01396, 2018.

[6] M. Anthony, P.L. Bartlett, Theoretical Foundations, Martin Anthony and Peter, Cambridge University Press, Cambridge, U.K., 1999. pp., 389, ISBN 0-521-57353X

[7] S. Rovetta, Z. Mnasri, F. Masulli, A. Cabri, Emotion recogni- tion from speech signal using fuzzy clustering, in 2019 Confer- ence of the International Fuzzy Systems Association and the Euro- pean Society for Fuzzy Logic and Technology (EUSFLAT 2019), Atlantis Press, Prague, September 9-13 2019.

[8] G. Ulutagay, E. Nasibov, Fuzzy and crisp clustering methods based on the neighborhood concept: a comprehensive review, J. Intell. Fuzzy Syst. 23 (2012), 271–281.

[9] P. Ekman, Anargument for basic emotions, Cogn.Emot.6(1992), 169–200.

[10] J.A. Russell, Acircumplex model of affect, J.Pers.Soc.Psychol.39 (1980), 1161.

[11] R. Plutchik, The nature of emotions: human emotions have deep evolutionary roots, a fact that may explain their complex- ity and provide tools for clinical practice, Am. Sci. 89 (2001), 344–350.

[12] T.L. Nwe, S.W. Foo, L.C. De Silva, Speech emotion recogni- tion using hidden markov models, Speech Commun. 41 (2003), 603–623.

[13] V. Hozjan, Z. Kačič, Context-independent multilingual emotion recognition from speech signals, Int. J. Speech Technol. 6 (2003), 311–320.

[14] J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks, Neural Comput. Appl. 9 (2000), 290–296.

[15] B. Schuller, G. Rigoll, M. Lang, Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture, in Acoustics, Speech, and Signal Processing, Proceedings (ICASSP’04), IEEE, Montreal, Canada, 2004, vol. 1, pp. 1–577.

[16] M.E. Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases, Pat- tern Recognit. 44 (2011), 572–587.

[17] J. Kim, R. Saurous., Emotion recognition from human speech using temporal information and deep learning, in Annual Con- ference of the International Speech Communication Association (Interspeech 2018), Hyderabad, India, 2018.

[18] J.H.L. Hansen, S.E. Bou-Ghazale, Getting started with susas: a speech under simulated and actual stress database, in Fifth Euro- pean Conference on Speech Communication and Technology, Rhodes, Greece, 1997.

[19] M.B. Akçay, K. Oğuz, Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers, Speech Commun. 116 (2020), 56–76.

[20] F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of german emotional speech, in Ninth European Con- ference on Speech Communication and Technology, Lisbon, Portugal, 2005.

[21] B. Schuller, S. Steidl, A. Batliner, The interspeech 2009 emo- tion challenge, in Tenth Annual Conference of the International Speech Communication Association, Brighton, United Kingdom September 6-10, 2009.

[22] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S. Narayanan, The interspeech 2010 paralinguistic challenge, in Proceeding INTERSPEECH 2010, Makuhari, Japan, 2010, pp. 2794–2797.

[23] F. Eyben, K.R. Scherer, B.W. Schuller, J.Sundberg, E. André, C. Busso, et al., The Geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing, IEEE Trans. Affect. Comput. 7 (2016), 190–202.

[24] D. Ververidis, C. Kotropoulos, Emotional speech recognition: resources, features, and methods, Speech Commun. 48 (2006), 1162–1181.

[25] C.M. Lee, S.S. Narayanan, R. Pieraccini, Classifying emotions in human-machine spoken dialogs, in Proceedings, IEEE Inter- national Conference on Multimedia and Expo, IEEE, Lausanne, Switzerland, 2002, vol. 1, pp. 737–740.

[26] D. Ververidis, C. Kotropoulos,I. Pitas, Automatic emotional speech classification, in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Montreal, Canada, 2004, vol. 1, pp. 1–593.

[27] M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotion recognition from noisy speech, in 2006 IEEE International Conference on Multi- media and Expo, IEEE, Toronto, Canada, 2006, pp. 1653–1656.

[28] M. You, C. Chen, J. Bu, J. Liu, J. Tao, A hierarchical framework for speech emotion recognition, in 2006 IEEE International Sym- posiumon Industrial Electronics, IEEE, Montreal, Canada, 2006, vol. 1, pp. 515–519.

[29] X. Mao, L. Chen, L. Fu, Multi-level speech emotion recognition based on HMM and ANN, in 2009 WRI World Congress on Com- puter Science and Information Engineering, IEEE, Los Angeles, CA, USA, 2009, vol. 7, pp. 225–229.

[30] L. Chen, X. Mao, Y. Xue, L.L. Cheng, Speech emotion recogni- tion: features and classification models, Digital Signal Process. 22 (2012), 1154–1160.

[31] F. Eyben, S. Buchholz, N. Braunschweiler, J. Latorre, V. Wan, M.J.F. Gales, K. Knill, Unsupervised clustering of emotion and voice styles for expressive TTS, in 2012 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP2012), IEEE, Kyoto, Japan, 2012, pp. 4009–4012.

[32] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, et al., The interspeech 2013 computational paralin- guistics challenge: social signals, conflict, emotion, autism, in Pro- ceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 2013.

[33] N. Andrew, Sparse autoencoder, 2011. https://web.stanford.edu/ class/cs294a/sparseAutoencoder_2011new.pdf

[34] C. Song, F. Liu, Y. Huang, L. Wang, T. Tan, Auto-encoder based data clustering, in: J. Ruiz-Shulcloper, G. Sanniti di Baja (Eds.), Iberoamerican Congress on Pattern Recognition, Springer, Berlin, Heidelberg, Germany, 2013, pp. 117–124.

[35] J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in International Conference on Machine Learning, 2016, pp. 478–487.

[36] F. Tian, B. Gao, Q. Cui, E. Chen, T.-Y. Liu, Learning deep repre- sentations for graph clustering, in Twenty-Eighth AAAI Confer- ence on Artificial Intelligence, July 27–31, Québec City, Québec, Canada. 2014.

[37] L. Salwinski, C.S. Miller, A.J. Smith, F.K. Pettit, J.U. Bowie, D. Eisenberg, The database of interacting proteins: 2004 update, Nucleic Acids Res. 32 (2004), D449–D451.

[38] C. Stark, B.-J. Breitkreutz, A. Chatr-Aryamontri, L. Boucher, R. Oughtred, M.S. Livstone, et al., The biogrid interaction database: 2011 update, Nucleic Acids Res. 39 (2010), D698–D704.

[39] A. Asuncion, D. H.Newman, UCI Machine Learning Repository, 2007. University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml,

[40] E. Székely, J.P. Cabral, P. Cahill, J. Carson-Berndsen, Cluster- ing expressive speech styles in audiobooks using glottal source parameters, in 12th Annual Conference of the International- Speech-Communication-Association, Florence, Italy August 27- 31., 2011.

[41] S. Ridella, S. Rovetta, R. Zunino, K-winner machines for pattern classification, IEEE Trans. Neural Netw. 12 (2001), 371–385.

[42] R. Rovetta, F. Masulli, Soft clustering: why and how to, in The 12th International Workshop on Fuzzy Logic and Applications (WILF 2018), 2018.

[43] R. Babuška, H.B. Verbruggen, Anover view of fuzzy modeling for control, Control Eng. Pract. 4 (1996), 1593–1606.

[44] M. Miyamoto, M. Mukaidono, Fuzzy C-Means as a regularization and maximum entropy approach, in Proceedings of the Seventh IFSA World Congress, Prague, Czech Republic, 1997, pp. 86–91.

[45] J.C. Bezdek, Pattern recognition with fuzzy objective function algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.

[46] R. Krishnapuram, J.M. Keller, A possibilistic approach to cluster- ing, IEEE Trans. Fuzzy Syst. 1 (1993), 98–110.

[47] R. Krishnapuram, J.M. Keller, The possibilistic C -Means algo- rithm: insights and recommendations, IEEE Trans. Fuzzy Syst. 4 (1996), 385–393.

[48] F. Masulli, S. Rovetta, Soft transition from probabilistic to pos- sibilistic fuzzy clustering, IEEE Trans. Fuzzy Syst. 14 (2006), 516–527.

[49] K. Rose, E. Gurewitz, G. Fox, A deterministic annealing approach to clustering, Pattern Recognit. Lett. 11 (1990), 589–594.

[50] M.-J. Caraty, C. Montacié, Detecting speech interruptions for automatic conflict detection, in: F. D’Errico, I. Poggi, A. Vincia- relli, L. Vincze (Eds.), Conflict and Multimodal Communication, Springer, Cham, Switzerland, 2015, pp. 377–401.

[51] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versa- tile and fast open-source audio feature extractor, in Proceedings of the 18th ACM international conference on Multimedia, ACM, Firenze, Italy, 2010,

[52] L. Guo, L. Wang, J. Dang, L. Zhang, H. Guan, X. Li, Speech emo- tion recognition by combining amplitude and phase information using convolutional neural network, in INTERSPEECH, 2018, pp. 1611–1615.

[53] Z.-W. Huang, W.-T. Xue, Q.-R. Mao, Speech emotion recognition with unsupervised feature learning, Front. Inf. Technol. Electron. Eng. 16 (2015), 358–366.

原文信息

S. Rovetta, Z. Mnasr, F. Masulli, A. Cabri, "Emotion Recognition from Speech: An Unsupervised Learning Approach", International Journal of Computational Intelligence systems, 2020, DOI: 10.2991/ijcis.d.201019.002.

微信图片_202101181119115.png

扫描二维码,获取英文原文

https://www.atlantis-press.com/journals/ijcis/125945494/

关于期刊

 

Journal_Cover_-_IJCIS.png

Impact Factor: 1.838, CiteScore: 3.59

International Journal of Computational Intelligence Systems(IJCIS)是欧洲模糊逻辑和技术学(EUSFLAT)会刊,主要刊载有关应用计算智能各个方面的原创性研究,尤其是针对证明使用了计算智能理论的技术和方法的研究型论文及综述等,由西班牙哈恩大学Luis Martínez Lopez教授和澳大利亚悉尼科技大学路节教授担任共同主编。本刊目前已被DOAJ, Science Citation Index Expanded (SCIE), Ei Compendex and Scopus等数据库收录。

Atlantis Press.png



https://blog.sciencenet.cn/blog-3453320-1267756.html

上一篇:小酌怡情 | 如何应对Zoom疲劳
下一篇:合辑荐读 | 数学和计算智能协同应对新挑战
收藏 IP: 123.53.39.*| 热度|

2 杨正瓴 黄永义

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-3-19 10:33

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部