Atlantis Press China分享 数字出版平台,开放获取先锋


文章荐读 | 语音情感识别:一种无监督学习方法 精选

已有 6117 次阅读 2021-1-18 11:25 |个人分类:文章荐读|系统分类:论文交流

文章荐读 | 语音情感识别:一种无监督学习方法


信息技术的快速发展使得人类与计算机的关系日益密切,情感识别作为智能人机交互的关键技术之一,扮演着重要角色。其中,表达性语音合成与识别建模是由有监督分类器完成的。这意味着需要预先进行标记和数据预处理,成本会随着数据库的大小而增加,此外还存在犯错误的风险。因此,为了避免标注的成本,同时减少由于缺乏数据而导致过度拟合的风险,无监督学习似乎是一种合适的替代方法。来自意大利热那亚大学University of Genoa)和突尼斯埃尔马纳尔大学(University Tunis El Manar)的学者们在期刊 International Journal of Computional Intelligence SystemsIJCIS)上发表了题为Emotion Recognition from Speech: An Unsupervised Learning Approach” 的文章,研究了基于无监督学习法的语音情感特征识别。




1 情绪数据库(EMO-DB)中情绪类的配价/唤醒模型[20]。本图来自[8]








4 利用主成分分析(PCA)将单个和分组的情绪类别分布可视化,以便在三维中投影特征:(a)对于单个情绪,三维投影表明言语间的标准特征没有足够的区分性;(b)将情绪分组以减少特征散射。


5 实验过程:预处理包括手工特征计算、特征嵌入和特征选择;聚类采用crispK-means)和/fuzzy方法,聚类标记用于恢复聚类中的情感类。评价结果为模糊度和隶属度矩阵之和。


6 用方差分析(ANOVA)方法对K-均值和不同等级可能性c-均值(GPCM)方法(二维主成分分析(PCA)投影)选择的21个聚类和96个特征对7个类别进行聚类结果:尽管二维投影没有显示出明显的不相交聚类,K-meansGPCM方法似乎能够分离一些类,例如类1和类2


7 混淆矩阵得分(aK-均值,(b)分级可能性c-均值(GPCM),聚类后标记,使用方差分析(ANOVA)选择的192个特征。




参考文献 References

[1]  B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, Acoustic emotion recognition: a benchmark comparison of per- formances, in IEEE Workshop on Automatic Speech Recogni- tion Understanding (ASRU 2009), IEEE, Merano, Italy, 2009, pp. 552–557.

[2]  C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, et al., Iemocap: interactive emotional dyadic motion cap- ture database, Lang. Res. Eval. 42 (2008), 335.

[3] H.M. Fayek, M. Lech, L. Cavedon, Evaluating deep learning archi-tectures for speech emotion recognition, Neural Netw. 92 (2017), 60–68.

[4]  E. Avots, T. Sapiński, M. Bachmann, D. Kamińska, Audiovi- sual emotion recognition in wild, Mach. Vis. Appl. 30 (2019), 975–985.

[5] M. Belkin, S. Ma, S. Mandal, Tounder stand deep learning we need to understand kernel learning, arXiv preprint, arXiv:1802.01396, 2018.

[6] M. Anthony, P.L. Bartlett, Theoretical Foundations, Martin Anthony and Peter, Cambridge University Press, Cambridge, U.K., 1999. pp., 389, ISBN 0-521-57353X

[7] S. Rovetta, Z. Mnasri, F. Masulli, A. Cabri, Emotion recogni- tion from speech signal using fuzzy clustering, in 2019 Confer- ence of the International Fuzzy Systems Association and the Euro- pean Society for Fuzzy Logic and Technology (EUSFLAT 2019), Atlantis Press, Prague, September 9-13 2019.

[8] G. Ulutagay, E. Nasibov, Fuzzy and crisp clustering methods based on the neighborhood concept: a comprehensive review, J. Intell. Fuzzy Syst. 23 (2012), 271–281.

[9] P. Ekman, Anargument for basic emotions, Cogn.Emot.6(1992), 169–200.

[10] J.A. Russell, Acircumplex model of affect, J.Pers.Soc.Psychol.39 (1980), 1161.

[11] R. Plutchik, The nature of emotions: human emotions have deep evolutionary roots, a fact that may explain their complex- ity and provide tools for clinical practice, Am. Sci. 89 (2001), 344–350.

[12] T.L. Nwe, S.W. Foo, L.C. De Silva, Speech emotion recogni- tion using hidden markov models, Speech Commun. 41 (2003), 603–623.

[13] V. Hozjan, Z. Kačič, Context-independent multilingual emotion recognition from speech signals, Int. J. Speech Technol. 6 (2003), 311–320.

[14] J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks, Neural Comput. Appl. 9 (2000), 290–296.

[15] B. Schuller, G. Rigoll, M. Lang, Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture, in Acoustics, Speech, and Signal Processing, Proceedings (ICASSP’04), IEEE, Montreal, Canada, 2004, vol. 1, pp. 1–577.

[16] M.E. Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases, Pat- tern Recognit. 44 (2011), 572–587.

[17] J. Kim, R. Saurous., Emotion recognition from human speech using temporal information and deep learning, in Annual Con- ference of the International Speech Communication Association (Interspeech 2018), Hyderabad, India, 2018.

[18] J.H.L. Hansen, S.E. Bou-Ghazale, Getting started with susas: a speech under simulated and actual stress database, in Fifth Euro- pean Conference on Speech Communication and Technology, Rhodes, Greece, 1997.

[19] M.B. Akçay, K. Oğuz, Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers, Speech Commun. 116 (2020), 56–76.

[20] F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A database of german emotional speech, in Ninth European Con- ference on Speech Communication and Technology, Lisbon, Portugal, 2005.

[21] B. Schuller, S. Steidl, A. Batliner, The interspeech 2009 emo- tion challenge, in Tenth Annual Conference of the International Speech Communication Association, Brighton, United Kingdom September 6-10, 2009.

[22] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S. Narayanan, The interspeech 2010 paralinguistic challenge, in Proceeding INTERSPEECH 2010, Makuhari, Japan, 2010, pp. 2794–2797.

[23] F. Eyben, K.R. Scherer, B.W. Schuller, J.Sundberg, E. André, C. Busso, et al., The Geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing, IEEE Trans. Affect. Comput. 7 (2016), 190–202.

[24] D. Ververidis, C. Kotropoulos, Emotional speech recognition: resources, features, and methods, Speech Commun. 48 (2006), 1162–1181.

[25] C.M. Lee, S.S. Narayanan, R. Pieraccini, Classifying emotions in human-machine spoken dialogs, in Proceedings, IEEE Inter- national Conference on Multimedia and Expo, IEEE, Lausanne, Switzerland, 2002, vol. 1, pp. 737–740.

[26] D. Ververidis, C. Kotropoulos,I. Pitas, Automatic emotional speech classification, in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Montreal, Canada, 2004, vol. 1, pp. 1–593.

[27] M. You, C. Chen, J. Bu, J. Liu, J. Tao, Emotion recognition from noisy speech, in 2006 IEEE International Conference on Multi- media and Expo, IEEE, Toronto, Canada, 2006, pp. 1653–1656.

[28] M. You, C. Chen, J. Bu, J. Liu, J. Tao, A hierarchical framework for speech emotion recognition, in 2006 IEEE International Sym- posiumon Industrial Electronics, IEEE, Montreal, Canada, 2006, vol. 1, pp. 515–519.

[29] X. Mao, L. Chen, L. Fu, Multi-level speech emotion recognition based on HMM and ANN, in 2009 WRI World Congress on Com- puter Science and Information Engineering, IEEE, Los Angeles, CA, USA, 2009, vol. 7, pp. 225–229.

[30] L. Chen, X. Mao, Y. Xue, L.L. Cheng, Speech emotion recogni- tion: features and classification models, Digital Signal Process. 22 (2012), 1154–1160.

[31] F. Eyben, S. Buchholz, N. Braunschweiler, J. Latorre, V. Wan, M.J.F. Gales, K. Knill, Unsupervised clustering of emotion and voice styles for expressive TTS, in 2012 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP2012), IEEE, Kyoto, Japan, 2012, pp. 4009–4012.

[32] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, et al., The interspeech 2013 computational paralin- guistics challenge: social signals, conflict, emotion, autism, in Pro- ceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 2013.

[33] N. Andrew, Sparse autoencoder, 2011. class/cs294a/sparseAutoencoder_2011new.pdf

[34] C. Song, F. Liu, Y. Huang, L. Wang, T. Tan, Auto-encoder based data clustering, in: J. Ruiz-Shulcloper, G. Sanniti di Baja (Eds.), Iberoamerican Congress on Pattern Recognition, Springer, Berlin, Heidelberg, Germany, 2013, pp. 117–124.

[35] J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in International Conference on Machine Learning, 2016, pp. 478–487.

[36] F. Tian, B. Gao, Q. Cui, E. Chen, T.-Y. Liu, Learning deep repre- sentations for graph clustering, in Twenty-Eighth AAAI Confer- ence on Artificial Intelligence, July 27–31, Québec City, Québec, Canada. 2014.

[37] L. Salwinski, C.S. Miller, A.J. Smith, F.K. Pettit, J.U. Bowie, D. Eisenberg, The database of interacting proteins: 2004 update, Nucleic Acids Res. 32 (2004), D449–D451.

[38] C. Stark, B.-J. Breitkreutz, A. Chatr-Aryamontri, L. Boucher, R. Oughtred, M.S. Livstone, et al., The biogrid interaction database: 2011 update, Nucleic Acids Res. 39 (2010), D698–D704.

[39] A. Asuncion, D. H.Newman, UCI Machine Learning Repository, 2007. University of California, Irvine, School of Information and Computer Sciences,,

[40] E. Székely, J.P. Cabral, P. Cahill, J. Carson-Berndsen, Cluster- ing expressive speech styles in audiobooks using glottal source parameters, in 12th Annual Conference of the International- Speech-Communication-Association, Florence, Italy August 27- 31., 2011.

[41] S. Ridella, S. Rovetta, R. Zunino, K-winner machines for pattern classification, IEEE Trans. Neural Netw. 12 (2001), 371–385.

[42] R. Rovetta, F. Masulli, Soft clustering: why and how to, in The 12th International Workshop on Fuzzy Logic and Applications (WILF 2018), 2018.

[43] R. Babuška, H.B. Verbruggen, Anover view of fuzzy modeling for control, Control Eng. Pract. 4 (1996), 1593–1606.

[44] M. Miyamoto, M. Mukaidono, Fuzzy C-Means as a regularization and maximum entropy approach, in Proceedings of the Seventh IFSA World Congress, Prague, Czech Republic, 1997, pp. 86–91.

[45] J.C. Bezdek, Pattern recognition with fuzzy objective function algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.

[46] R. Krishnapuram, J.M. Keller, A possibilistic approach to cluster- ing, IEEE Trans. Fuzzy Syst. 1 (1993), 98–110.

[47] R. Krishnapuram, J.M. Keller, The possibilistic C -Means algo- rithm: insights and recommendations, IEEE Trans. Fuzzy Syst. 4 (1996), 385–393.

[48] F. Masulli, S. Rovetta, Soft transition from probabilistic to pos- sibilistic fuzzy clustering, IEEE Trans. Fuzzy Syst. 14 (2006), 516–527.

[49] K. Rose, E. Gurewitz, G. Fox, A deterministic annealing approach to clustering, Pattern Recognit. Lett. 11 (1990), 589–594.

[50] M.-J. Caraty, C. Montacié, Detecting speech interruptions for automatic conflict detection, in: F. D’Errico, I. Poggi, A. Vincia- relli, L. Vincze (Eds.), Conflict and Multimodal Communication, Springer, Cham, Switzerland, 2015, pp. 377–401.

[51] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versa- tile and fast open-source audio feature extractor, in Proceedings of the 18th ACM international conference on Multimedia, ACM, Firenze, Italy, 2010,

[52] L. Guo, L. Wang, J. Dang, L. Zhang, H. Guan, X. Li, Speech emo- tion recognition by combining amplitude and phase information using convolutional neural network, in INTERSPEECH, 2018, pp. 1611–1615.

[53] Z.-W. Huang, W.-T. Xue, Q.-R. Mao, Speech emotion recognition with unsupervised feature learning, Front. Inf. Technol. Electron. Eng. 16 (2015), 358–366.


S. Rovetta, Z. Mnasr, F. Masulli, A. Cabri, "Emotion Recognition from Speech: An Unsupervised Learning Approach", International Journal of Computational Intelligence systems, 2020, DOI: 10.2991/ijcis.d.201019.002.






Impact Factor: 1.838, CiteScore: 3.59

International Journal of Computational Intelligence Systems(IJCIS)是欧洲模糊逻辑和技术学(EUSFLAT)会刊,主要刊载有关应用计算智能各个方面的原创性研究,尤其是针对证明使用了计算智能理论的技术和方法的研究型论文及综述等,由西班牙哈恩大学Luis Martínez Lopez教授和澳大利亚悉尼科技大学路节教授担任共同主编。本刊目前已被DOAJ, Science Citation Index Expanded (SCIE), Ei Compendex and Scopus等数据库收录。

Atlantis Press.png

上一篇:小酌怡情 | 如何应对Zoom疲劳
下一篇:合辑荐读 | 数学和计算智能协同应对新挑战
收藏 IP: 123.53.39.*| 热度|

2 杨正瓴 黄永义

该博文允许注册用户评论 请点击登录 评论 (0 个评论)


Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-7-28 06:24

Powered by

Copyright © 2007- 中国科学报社
