大工至善|大学至真分享 http://blog.sciencenet.cn/u/lcj2212916

博文

[转载]【信息技术】【2014.12】复音:定义、模型与检测

已有 288 次阅读 2019-5-17 18:56 |系统分类:科研笔记|文章来源:转载


本文为澳大利亚格拉茨技术大学作者:Dipl.-Ing. Philipp Aichinger)的博士论文154

 

语音障碍需要更好地理解因为这可能会减少工作机会导致社会孤立要解决这些问题需要正确的治疗表征和效果衡量必须依靠强有力的临床干预研究结果复音是一种严重的经常被误解的声音障碍症状根据其潜在的病因复音患者通常接受的治疗如言语矫正疗法或发音疗法在目前的临床实践中复音是由医生在听觉上确定的从循证医学和科学方法论的角度来看这是存在问题的本论文的目的是为了实现复音症状的自动检测

 

本文选择了40名正常发音、40名复音和40名发音困难的受试者收集的材料包括喉部高速视频和同步的高质量录音所有材料均已标注数据质量并应用了无损数据预选对双音声带振动模式即声门双音进行了识别提出了从喉部高速视频中自动检测的方法频率图像双峰性是基于像素强度时间序列的频率分析该方法能完全自动工作对悦耳音调阴性组的分类准确率为78%,对发音困难阴性组分类准确率为75%。频率图双峰性是基于声门边缘轨迹的频率分析能够处理空间分割的视频这些视频是通过人工干预获得的频率图双峰性对悦耳音调阴性组的分类精度略高达到82.9%,对发音困难阴性组的分类精度达到77.5%。

 

提出并评价了一种分析声门区和声门区二音波形的双振波形模型该模型用于建立波形中二次振荡器的检测算法并定义了生理上可解释的双音图”。在区分二音和严重发音困难时二音图的分类准确率为87.2%。相比之下传统的声音嘶哑特征在这项任务中的表现较差隐类分析是从概率的角度来评价实践中的真实性使用的专家注释具有很高的灵敏度(96.5%)和完美的特异性(100%)。二音图是从语音中检测二次发声间隔的最有效的自动方法

 

二音图是基于模型结构优化音频波形建模和综合分析的结果它比传统的声音嘶哑特征更适合描述二音信号综合分析和波形建模已经在语音研究中得到应用但对感知语音质量的模型结构优化进行系统研究是一个新课题对于双重发声来说一个和两个振荡器之间的切换至关重要最优模型结构是一种定性的结果可以从生理上解释推测模型结构优化对于描述除双音以外的其他语音现象也很有用由此得到的描述符可能比传统的描述符更容易被临床医生接受

 

双重发音的有用定义集中在感知声学和声门振动的水平上由于其主观性建议在临床语音评价中避免单纯使用感知定义声门振动水平与远端原因有关其临床意义重大但难以评估通过两个振荡器波形模型在声级上定义是有利的并可用于体内测试建议根据不同的描述级别更新语音现象的定义和术语

 

Voice disorders need to be better understood because they may lead to reduced job chances and social isolation. Correct treatment indication and treatment effect measurements are needed to tackle these problems. They must rely on robust outcome measures for clinical intervention studies. Diplophonia is a severe and often misunderstood sign of voice disorders. Depending on its underlying etiology, diplophonic patients typically receive treatment such as logopedic therapy or phonosurgery. In the current clinical practice diplophonia is determined auditively by the medical doctor, which is problematic from the viewpoints of evidence-based medicine and scientific methodology. The aim of this thesis is to work towards objective (i.e., automatic) detection of diplophonia. A database of 40 euphonic, 40 diplophonic and 40 dysphonic subjects has been acquired. The collected material consists of laryngeal high-speed videos and simultaneous high-quality audio recordings. All material has been annotated for data quality and a non-destructive data pre-selection is applied. Diplophonic vocal fold vibration patterns (i.e., glottal diplophonia) are identified and procedures for automated detection from laryngeal high-speed videos are proposed. Frequency Image Bimodality is based on frequency analysis of pixel intensity time series. It is obtained fully automatically and yields classification accuracies of 78 % for the euphonic negative group and 75 % for the dysphonic negative group. Frequency Plot Bimodality is based on frequency analysis of glottal edge trajectories. It processes spatially segmented videos, which are obtained via manual intervention. Frequency Plot Bimodality obtains slightly higher classification accuracies of 82.9 % for the euphonic negative group and 77.5 % for the dysphonic negative group. A two-oscillator waveform model for analyzing acoustic and glottal area diplophonic waveforms is proposed and evaluated. The model is used to build a detection algorithm for secondary oscillators in the waveform and to define the physiologically interpretable ”Diplophonia Diagram”. The Diplophonia Diagram yields a classification accuracy of 87.2 % when distinguishing diplophonia from severely dysphonic voices. In contrast, the performance of conventional hoarseness features is low on this task. Latent class analysis is used to evaluate the used ground truth from a probabilistic point of view. The used expert annotations achieve very high sensitivity (96.5 %) and perfect specificity (100 %). The Diplophonia Diagram is the best available automatic method for detecting diplophonic phonation intervals from speech. The Diplophonia Diagram is based on model structure optimization, audio waveform modeling and analysis-by-synthesis, which enables a more suitable description of diplophonic signals than conventional hoarseness features. Analysis-by-synthesis and waveform modeling had already been carried out in voice research, but systematic investigation of model structure optimization with respect to perceived voice quality is novel. For diplophonia, the switch between one and two oscillators is crucial. Optimal model structure is a qualitative outcome that may be interpreted physiologically and one may conjecture that model structure optimization is also useful for describing other voice phenomena than diplophonia. The obtained descriptors might be more easily accepted by clinicians than the conventional ones. Useful definitions of diplophonia focus on the levels of perception, acoustics and glottal vibration. Due to its subjectivity, it is suggested to avoid the sole use of the perceptual definition in clinical voice assessment. The glottal vibration level connects with distal causes, which is of high clinical interest but difficult to assess. The definition at the acoustic level via two-oscillator waveform models is favored and used for in vivo testing. Updating definitions and terminology of voice phenomena with respect to different levels of description is suggested.

 

 

引言

一个同步录制高质量音频的喉部高速视频数据库

声带振动的空间分析与模型

分析二音的两个振荡器波形模型

诊断测试及其解释

讨论结论与未来研究展望 


下载英文原文地址:

http://page2.dfpan.com/fs/fl7c3jd26211f2e9160/ 


更多精彩文章请关注微信号:qrcode_for_gh_60b944f6c215_258.jpg



http://blog.sciencenet.cn/blog-69686-1179641.html

上一篇:[转载]【新书推荐】【2019.02】非视距雷达
下一篇:[转载]【源码】基于二进制搜索的快速直方图算法

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备14006957 )

GMT+8, 2019-6-16 04:46

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部