基于掩蔽估计与优化的单通道语音增强算法

2019-11-15 04:49葛宛营张天骐
计算机应用 2019年10期

葛宛营 张天骐

摘 要:单通道语音增强算法通过从带噪语音中估计并抑制噪声成分来得到增强语音。然而,噪声估计算法在计算时存在过估现象,导致部分估计噪声能量值比实际值大。尽管可以通过补偿消去这些过估值,但引入的误差同样会降低增强语音的整体质量。针对此问题,提出一种基于计算听觉场景分析(CASA)的时频掩蔽估计与优化算法。首先,通过直接判决(DD)算法估计先验信噪比(SNR)并计算初始掩蔽;其次,利用噪声与带噪语音在Gammatone频带内的互相关(ICC)系数来计算噪声的存在概率,结合带噪语音能量谱得到新的噪声估计,减少原估计噪声中的过估成分;然后,利用优化算法对初始掩蔽进行迭代处理以减少其中因噪声过估而存在的误差并增加其中的目标语音成分,在满足条件后停止迭代并得到新的掩蔽;最后,利用新的掩蔽合成增强语音。实验结果表明在不同的背景噪声下,相比优化前,

新的掩蔽使增强语音获得了较高的主观语音质量(PESQ)和语音可懂度(STOI)值,

提升了语音听感与可懂度。

关键词:计算听觉场景分析;语音增强;时频掩蔽;噪声估计;掩蔽优化;语音可懂度

中图分类号:TN912.35

文献标志码:A

Abstract: Monaural speech enhancement algorithms obtain enhanced speech by estimating and negating the noise components in speech with noise. However, the over-estimation and the error of the introduction to make up the over-estimation of noise power make detrimental effect on the enhanced speech. To constrain the distortion caused by noise over-estimation, a time-frequency mask estimation and optimization algorithm based on Computational Auditory Scene Analysis (CASA) was proposed. Firstly, Decision Directed (DD) algorithm was used to estimate the priori Signal-to-Noise Ratio (SNR) and calculate the initial mask. Secondly, the Inter-Channel Correlation (ICC) factor between noise and speech with noisein each Gammatone filterbank channelwas used to calculate the noise presence probability, the new noise estimation was obtained by the probability combining with the power spectrum of speech with noise, and the over-estimation of the primary estimated noise was decreased. Thirdly, the initial mask was iterated by the optimization algorithm to reduce the error caused by the noise over-estimation and raise the target speech components in the mask, and the new mask was obtained when the iteration stopped with the conditions met.Finally, the optimization method was used to optimize the estimated  mask.The enhanced speech was composed by using the new mask. Experimental results demonstrate that the new mask has higher Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility measure (STOI) values of the enhanced speech in comparison with the mask before optimization, improving the intelligibility and listening feeling of speech.Key words:  computational auditory scene analysis; speech enhancement; time-frequency mask; noise estimation; mask optimization; speech intelligibility

0 引言

語音增强作为一项前端处理技术,目的是从受噪声干扰的语音中提取出目标语音。按照接收麦克风的个数可将语音增强方法分为单通道和多通道增强方法。相对于多通道语音增强方法,单通道语音增强方法具有成本低、易实现等优点,在通信、语音识别等领域有着广泛的应用。传统的单通道语音增强方法包括谱减法[1]、维纳滤波法[2]、子空间算法[3]等。

近年来,研究人员通过模拟人耳处理声音信号的方式,提出了计算听觉场景分析(Computational Auditory Scene Analysis, CASA),其中Gammatone滤波器组便是一种用来模拟人耳耳蜗的听觉模型。经过滤波器组处理后的语音信号,能够得到相对传统方式更好的效果。基于CASA的语音增强算法通常根据基音周期[4]、等特征,构造区分目标语音与背景噪声的掩蔽,进而得到增强后的语音信号。在单通道语音增强算法中,需要对噪声能量进行估计,然而由于噪声的随机性,使得估计过程中存在过估现象,从而降低了增强语音的整体质量[6-7]。文献[6]利用Gammatone滤波器组的非线性频率特征,计算噪声与带噪语音在滤波器组各频带内的互相关系数,减少估计噪声中过估的成分后采用凸优化算法迭代得到语音能量谱的估计。但算法在得到语音能量谱后还需要进一步聚类处理,利用计算得到的掩蔽恢复增强语音。受聚类准确性的影响,通常恢复得到的增强语音在听感和可懂度方面存在欠缺。

针对上述问题,本文提出一种结合直接判决(Decision Directed, DD)算法[8]和频带内互相关(Inter-Channel Correlation, ICC)系数[6]的时频掩蔽估计与优化算法。首先,通过DD算法得到初始掩蔽估计;接着,计算出各频带内噪声与带噪语音的互相关系数,得到噪声的存在概率;然后,根据掩蔽的特性确定目标函数,结合前两步结果,通过优化算法减少初始掩蔽中的误差;最后,利用新的掩蔽从带噪语音中去除噪声信号,得到增强语音。

1 语音增强原理

一般情况下,带噪语音由语音和加性噪声合成:

声信号经过Gammatone滤波器组滤波后被分到带宽不同的64个频带中,各频带的中心频率和带宽由等效矩形带宽(Equivalent Rectangular Bandwidth, ERB)方法确定[9]。将各频带内的声信号经过加窗、分帧后得到时频单元序列,计算每个时频单元的能量后得到声信号的能量谱[4]。假设噪声与目标语音相互独立,经过滤波器组处理后信号的能量在时间帧为t、频带中心频率为f的时频单元中表示为:

CASA语音增强需要利用掩蔽与带噪语音合成时域内的增强语音[5-6,10]。理想二值掩蔽(Ideal Binary Mask, IBM)为一种常用的掩蔽,其值当目前时频点上语音能量占主导时为1,其他情况下为0。采用IBM得到的增强语音能够保留目标语音占主导的部分,消去其他部分。另一种掩蔽为理想浮值掩蔽(Ideal Ratio Mask, IRM),取值在0~1,且语音部分的值比噪声部分大。

相对于IBM,采用IRM得到的增强语音能够保存夹杂在噪声中的弱语音成分,具有更高的语音质量。因此本文计算的掩蔽为理想浮值掩蔽,公式为:

2 掩蔽估计与优化

2.1 算法整体框架

本文算法的整体框架如图1所示。

算法包含两部分:掩蔽估计和掩蔽优化。在掩蔽估计部分,估计噪声并计算后验信噪比后,利用最大似然估计得到先验信噪比,然后计算初始掩蔽。在掩蔽优化部分,将通过Gammatone滤波器组后的带噪语音与估计噪声信号分帧、加窗处理后进行离散傅里叶变换,计算各频带内带噪语音与噪声信号的互相关系数;为了修正噪声过估对初始掩蔽的影响,将得到的互相关系数作为噪声的存在概率并结合带噪语音得到优化目标,利用优化目标对初始掩蔽进行迭代处理,在减少过估而引起的偏差的同时,增加掩蔽中包含的目标语音成分。最后使用优化后的新掩蔽合成增强语音。

2.2 掩蔽估计

由式(2)~(3)可得掩蔽与语音能量的关系为:

2.3.3 掩蔽优化

由于语音能量取值范围为(0,+∞),且各时频单元间能量值差异很大,导致每次迭代计算S^(t, f)的运算量十分大。同时,为解决聚类的准确性和二值掩蔽对算法的影响,本文使用浮值掩蔽值替代式(14)中的能量值来当作优化目标:

3 實验与结果分析

3.1 实验参数与评价指标

仿真实验选取TIMIT数据库[14]中的语音信号。信号采样频率为16 kHz,16 bit量化,时长约为2 s。噪声取自noisex-92数据库[15],分别为Babble噪声、Engine噪声和White噪声,分别在输入信噪比为-5~5dB、间隔为1dB的情况下测试本文算法。

实验使用4阶64频带Gammatone滤波器组,每一帧长20ms,帧重叠为50%。选取参数为:式(10)中,a=-2,c=2.7,ζ=0.015。式(12)中,Gmin=0.178。式(19)中λ=0.02,式(22)中μ=0.01,式(23)中θ=0.3。式(24)中n1=1,n2=1。

本文使用文献[16]算法对时域的噪声信号进行估计,将该算法与文献[6]算法作为对比算法。选取的评价指标除分段信噪比(segmental Signal-to-Noise Ratio, segSNR)外,还有主观语音质量(Perceptual Evaluation of Speech Quality, PESQ)[17] 和语音可懂度(Short-Time Objective Intelligibility measure, STOI)[18]。分段信噪比计算信号每帧的信噪比后取平均值,其值越高说明算法对噪声的抑制效果越好;PESQ表示增强语音的主观听感,其得分越高,表明增强语音的听感越好;STOI反映了增强语音的失真程度,其数值越大表明算法造成的失真越小,语音的可懂度越高。

3.2 结果与分析

表1~2给出了在三种背景噪声下掩蔽优化前后得到的平均PESQ和STOI值。比较结果可以看出,经过掩蔽优化后增强语音的听感与可懂度都得到了提升,尤其是在Engine噪声下两项指标提升较为明显。

表3為两种掩蔽得到的segSNR平均值。可见,在Babble噪声和White噪声下,掩蔽优化后得到的segSNR值均小于优化前的结果,即本文提出的掩蔽优化算法无法有效抑制噪声。分析其原因,虽然估计的频带间互相关系数和真实值存在相似性(见图4),但其取值范围在低频部分并不相同,使得计算得到的噪声存在概率总是大于实际概率。因此相对初始掩蔽,优化后的掩蔽在合成增强语音时保留了更多的噪声成分。

对比表4中三种算法的PESQ值可看出:本文算法比对比算法得到了相对较高的PESQ值,但在Babble噪声下其结果低于文献[6]算法,是因为文献[6]采用了结合IBM与IRM的掩蔽,且最终掩蔽中IBM占比较大,使其算法在吵闹噪声下能够消去多余的噪声成分,得到更好的主观听感。在其他噪声环境下,本文算法得到的主观听感均高于对比算法。

表6为三种算法得到的segSNR值。由表6可知,文献[16]算法有最高的噪声抑制性能,而采用Gammatone频带内互相关系数的文献[6]算法与本文算法均取得了较低的segSNR值。这一方面是由于频带间互相关系数和真实值存在差异,另一方面是改进的DD算法在计算时并未区分语音的低频和高频成分,同时其在瞬时信噪比较低时对prop的估计不准确,使得初始掩蔽中仍保留了部分噪声成分。解决这一问题是本研究下一步工作之一。然而,相对于抑制噪声的能力,语音增强算法更注重提升语音听感与可懂度[6,19],根据PESQ和STOI结果,本文算法在这两个方面优于对比算法。

4 结语

本文针对单通道语音增强时,传统噪声估计算法中存在的过估现象会影响增强语音的整体质量问题,提出一种基于时频掩蔽估计与优化的单通道语音增强算法。该算法在得到初始掩蔽后,利用迭代优化增加初始掩蔽中的目标语音成分。实验结果表明,算法虽然不能提升初始掩蔽抑制噪声的性能,

但在另外两项关键指标(PESQ和STOI)上

本文算法较对比算法均有明显提升,说明本文算法能有效提升增强语音的听感与可懂度。

参考文献(References)

[1] 曹亮, 张天骐, 高洪兴, 等. 基于听觉掩蔽效应的多频带谱减语音增强方法[J]. 计算机工程与设计, 2013, 34(1): 235-240. (CAO L, ZHANG T Q, GAO H X, et al. Multi-band spectral subtraction method for speech enhancement based on masking property of human auditory system[J]. Computer Engineering and Design, 2013, 34(1): 235-240.)

[2] 李季碧, 马永保, 夏杰, 等. 一种基于修正倒谱平滑技术改进的维纳滤波语音增强算法[J]. 重庆邮电大学学报(自然科学版), 2016, 28(4): 462-467. (LI J B, MA Y B, XIA J, et al. An improved Wiener filtering speech enhancement algorithm based on modified cepstrum smooth technology[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2016, 28(4): 462-467.)

[3] BOROWICZ A, PETROVSKY A. Signal subspace approach for psychoacoustically motivated speech enhancement[J]. Speech communication, 2011, 53(2): 210-219.

[4] HU K, WANG D. Unvoiced speech segregation from nonspeech interference via CASA and spectral subtraction[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(6): 1600-1609.

[5] WANG Y, NARAYANAN A, WANG D, et al. On training targets for supervised speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12): 1849-1858.

[6] BAO F, ABDULLA W H. Noise masking method based on an effective ratio mask estimation in Gammatone channels[J]. APSIPA Transactions on Signal and Information Processing, 2018, 7(e5):1-12.

[7] SUN M, LI Y, GEMMEKE J F, et al. Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(7): 1233-1242.

[8] NAHMA L, YONG P C, DAM H H, et al. Convex combination framework for a priori SNR estimation in speech enhancement[C]// Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ. IEEE, 2017: 4975-4979.

[9] 蔣毅, 刘润生, 冯振明. 基于听感知特性的双麦克风近讲语音增强算法[J]. 清华大学学报(自然科学版), 2014(9): 1179-1183. (JIANG Y, LIU R S, FENG Z M. Dual-microphone speech enhancement algorithm based on the auditory features for a close-talk system[J]. Journal of Tsinghua University (Science and Technology), 2014, 54(9): 1179-1183.)

[10] BAO F, ABDULLA W H. A new ratio mask representation for CASA-based speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, 27(1): 7-19.

[11] YONG P C, NORDHOLM S, DAM H H, et al. On the optimization of sigmoid function for speech enhancement[C]// Proceedings of the 19th European Signal Processing Conference. Piscataway: IEEE, 2011: 211-215.

[12] CHEN Z, HOHMANN V. Online monaural speech enhancement based on periodicity analysis and a priori SNR estimation[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015, 23(11): 1904-1916.

[13] ZHENG C, TAN Z, PENG R, et al. Guided spectrogram filtering for speech dereverberation[J]. Applied Acoustics, 2018, 134(5): 154-159.

[14] GAROFOLO J S, LAMEL L F, FISHER W M, et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus[EB/OL]. [2019-01-12]. https://catalog.ldc.upenn.edu/LDC93S1.

[15] VARGA A, STEENEKEN H J M. Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems[J]. Speech Communication, 1993, 12(3): 247-251.

[16] GERKMANN T, HENDRIKS R C. Unbiased MMSE-based noise power estimation with low complexity and low tracking delay[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4): 1383-1393.

[17] International Telecommunications Union (ITU). Perceptual Evaluation of Speech Quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs[EB/OL]. [2019-01-12]. https://www.itu.int/rec/T-REC-P.862-200102-I/en.

[18] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125-2136.

[19] LOIZOU P C, KIM G. Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(1): 47-56.

This work is partially supported by the National Natural Science Foundation of China (61671095, 61702065, 61701067, 61771085), the Project of Key Laboratory of Signal and Information Processing of Chongqing (CSTC2009CA2003), the Chongqing Graduate Research and Innovation Project (CYS17219), the Research Project of Chongqing Educational Commission (KJ1600427, KJ1600429).

GE Wanying, born in 1994, M. S. candidate. His research interests include signal processing, speech enhancement.

ZHANG Tianqi, born in 1971, Ph. D., professor. Her research interests include spread spectrum communications, blind signal processing, speech signal processing.