葛宛营 张天骐
摘 要:单通道语音增强算法通过从带噪语音中估计并抑制噪声成分来得到增强语音。然而,噪声估计算法在计算时存在过估现象,导致部分估计噪声能量值比实际值大。尽管可以通过补偿消去这些过估值,但引入的误差同样会降低增强语音的整体质量。针对此问题,提出一种基于计算听觉场景分析(CASA)的时频掩蔽估计与优化算法。首先,通过直接判决(DD)算法估计先验信噪比(SNR)并计算初始掩蔽;其次,利用噪声与带噪语音在Gammatone频带内的互相关(ICC)系数来计算噪声的存在概率,结合带噪语音能量谱得到新的噪声估计,减少原估计噪声中的过估成分;然后,利用优化算法对初始掩蔽进行迭代处理以减少其中因噪声过估而存在的误差并增加其中的目标语音成分,在满足条件后停止迭代并得到新的掩蔽;最后,利用新的掩蔽合成增强语音。实验结果表明在不同的背景噪声下,相比优化前,
Abstract: Monaural speech enhancement algorithms obtain enhanced speech by estimating and negating the noise components in speech with noise. However, the over-estimation and the error of the introduction to make up the over-estimation of noise power make detrimental effect on the enhanced speech. To constrain the distortion caused by noise over-estimation, a time-frequency mask estimation and optimization algorithm based on Computational Auditory Scene Analysis (CASA) was proposed. Firstly, Decision Directed (DD) algorithm was used to estimate the priori Signal-to-Noise Ratio (SNR) and calculate the initial mask. Secondly, the Inter-Channel Correlation (ICC) factor between noise and speech with noisein each Gammatone filterbank channelwas used to calculate the noise presence probability, the new noise estimation was obtained by the probability combining with the power spectrum of speech with noise, and the over-estimation of the primary estimated noise was decreased. Thirdly, the initial mask was iterated by the optimization algorithm to reduce the error caused by the noise over-estimation and raise the target speech components in the mask, and the new mask was obtained when the iteration stopped with the conditions met.Finally, the optimization method was used to optimize the estimated mask.The enhanced speech was composed by using the new mask. Experimental results demonstrate that the new mask has higher Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility measure (STOI) values of the enhanced speech in comparison with the mask before optimization, improving the intelligibility and listening feeling of speech.Key words: computational auditory scene analysis; speech enhancement; time-frequency mask; noise estimation; mask optimization; speech intelligibility
0 引言
近年来,研究人员通过模拟人耳处理声音信号的方式,提出了计算听觉场景分析(Computational Auditory Scene Analysis, CASA),其中Gammatone滤波器组便是一种用来模拟人耳耳蜗的听觉模型。经过滤波器组处理后的语音信号,能够得到相对传统方式更好的效果。基于CASA的语音增强算法通常根据基音周期[4]、等特征,构造区分目标语音与背景噪声的掩蔽,进而得到增强后的语音信号。在单通道语音增强算法中,需要对噪声能量进行估计,然而由于噪声的随机性,使得估计过程中存在过估现象,从而降低了增强语音的整体质量[6-7]。文献[6]利用Gammatone滤波器组的非线性频率特征,计算噪声与带噪语音在滤波器组各频带内的互相关系数,减少估计噪声中过估的成分后采用凸优化算法迭代得到语音能量谱的估计。但算法在得到语音能量谱后还需要进一步聚类处理,利用计算得到的掩蔽恢复增强语音。受聚类准确性的影响,通常恢复得到的增强语音在听感和可懂度方面存在欠缺。
针对上述问题,本文提出一种结合直接判决(Decision Directed, DD)算法[8]和频带内互相关(Inter-Channel Correlation, ICC)系数[6]的时频掩蔽估计与优化算法。首先,通过DD算法得到初始掩蔽估计;接着,计算出各频带内噪声与带噪语音的互相关系数,得到噪声的存在概率;然后,根据掩蔽的特性确定目标函数,结合前两步结果,通过优化算法减少初始掩蔽中的误差;最后,利用新的掩蔽从带噪语音中去除噪声信号,得到增强语音。
1 语音增强原理
声信号经过Gammatone滤波器组滤波后被分到带宽不同的64个频带中,各频带的中心频率和带宽由等效矩形带宽(Equivalent Rectangular Bandwidth, ERB)方法确定[9]。将各频带内的声信号经过加窗、分帧后得到时频单元序列,计算每个时频单元的能量后得到声信号的能量谱[4]。假设噪声与目标语音相互独立,经过滤波器组处理后信号的能量在时间帧为t、频带中心频率为f的时频单元中表示为:
CASA语音增强需要利用掩蔽与带噪语音合成时域内的增强语音[5-6,10]。理想二值掩蔽(Ideal Binary Mask, IBM)为一种常用的掩蔽,其值当目前时频点上语音能量占主导时为1,其他情况下为0。采用IBM得到的增强语音能够保留目标语音占主导的部分,消去其他部分。另一种掩蔽为理想浮值掩蔽(Ideal Ratio Mask, IRM),取值在0~1,且语音部分的值比噪声部分大。
2 掩蔽估计与优化
2.1 算法整体框架
2.2 掩蔽估计
2.3.3 掩蔽优化
由于语音能量取值范围为(0,+∞),且各时频单元间能量值差异很大,导致每次迭代计算S^(t, f)的运算量十分大。同时,为解决聚类的准确性和二值掩蔽对算法的影响,本文使用浮值掩蔽值替代式(14)中的能量值来当作优化目标:
3 實验与结果分析
3.1 实验参数与评价指标
仿真实验选取TIMIT数据库[14]中的语音信号。信号采样频率为16 kHz,16 bit量化,时长约为2 s。噪声取自noisex-92数据库[15],分别为Babble噪声、Engine噪声和White噪声,分别在输入信噪比为-5~5dB、间隔为1dB的情况下测试本文算法。
本文使用文献[16]算法对时域的噪声信号进行估计,将该算法与文献[6]算法作为对比算法。选取的评价指标除分段信噪比(segmental Signal-to-Noise Ratio, segSNR)外,还有主观语音质量(Perceptual Evaluation of Speech Quality, PESQ)[17] 和语音可懂度(Short-Time Objective Intelligibility measure, STOI)[18]。分段信噪比计算信号每帧的信噪比后取平均值,其值越高说明算法对噪声的抑制效果越好;PESQ表示增强语音的主观听感,其得分越高,表明增强语音的听感越好;STOI反映了增强语音的失真程度,其数值越大表明算法造成的失真越小,语音的可懂度越高。
3.2 结果与分析
4 结语
This work is partially supported by the National Natural Science Foundation of China (61671095, 61702065, 61701067, 61771085), the Project of Key Laboratory of Signal and Information Processing of Chongqing (CSTC2009CA2003), the Chongqing Graduate Research and Innovation Project (CYS17219), the Research Project of Chongqing Educational Commission (KJ1600427, KJ1600429).
GE Wanying, born in 1994, M. S. candidate. His research interests include signal processing, speech enhancement.
ZHANG Tianqi, born in 1971, Ph. D., professor. Her research interests include spread spectrum communications, blind signal processing, speech signal processing.