LI Yin-guo,PU Fu-an,,Thomas Fang ZHENG
(1.Automotive Electronics and Embedded Systems Engineering R&D Center,Chongqing University of Posts and Telecommunications,Chongqing 400065,P.R.China 2.Center for Speech and Language Technologies,Division of Technical Innovation and Development,Tsinghua National Laboratory for Information Science and Technology,Beijing 100084,P.R China)
A main issue in practical speech recognition is to improve the robustness against the mismatch between the train and testing environments[1].The performance of speech recognition systems rapidly degrades if there exists channel distortion,background noise,acoustic echo,or a variety of interfering signals.A great deal of attention has been paid continuously to this problem[2],in an effort to deploy the technology in the field.When these mismatches occur,the speech recognizer could become unstable.In this paper,we will focus on the environment in which the clean speech is corrupted with background noise.
Many techniques have been deployed to alleviate the recognition performance degradation,such as model adaptation,speech enhancement,and feature compensation[3-4].The model adaptation is powerful because there are more parameters in the system that can be changed to reduce the acoustic mismatch.Since there exists a great constraint on how these parameters can be modified,which is limited by a simple model of the degradation,this adaptation should not require a lot of speech data.Nevertheless,it has been found that due to inaccuracies in the model,the use of more data typically results in higher accuracy.An advantage of this approach is that it provides a graceful degradation under very severe mismatch conditions.One of the problems of these approaches is the large number of computations required to adapt model,which is not suitable for the rapid adaptation needed in rapidly changing environments such as telephone applications.
The speech enhancement is the simplest way to move the noise from the input noise speech signal in front-end module before feature extraction.Numerous techniques have been already studied[5].Some of them are based on the well-known spectral subtraction(SS)approach that is suitable for enhancing speech embedded in stationary noise and its relatively simple implementation and computational efficiency.But these methods usually remain a different level of residual and unnatural background noise called musical noise.Another popular fundamental approach in speech enhancement is the Wiener Filter,which estimates clean speech from noisy speech in the sense of minimum mean square error(MMSE)given statistical characteristics.Experiments show that the amount of noise attenuation is general proportionate to the amount of speech degradation[6].In other words,the more the noise is reduced,the more the speech is distorted.Both the musical noise and speech distortion degrade the accuracy of the recognition.So,many systems develop the approach that deployed enhanced speech for voice activity detection(VAD)and noise speech,also called unprocessed speech,for feature extraction following decoding.
Approaches that operate by compensating the input features have the limitation of having to make transformations without the acoustic knowledge used in the search process,possibly using an inaccurate correction vector.However,they typically require little computation,achieve rapid environment adaptation,and if the mismatch is not very severe they can perform as well as the model adaptation approaches do.Quite several well-known feature normalization methods for feature domain have been developed.CMS is an easy but effective way to remove the convolutional noise introduced by the transmission channel.A natural extension of CMS is cepstral mean and variance normalization(CMVN)[7].So it can improve the robustness to additive noise as well as the channel effects.Not only does it provide with an error rate reduction under mismatch conditions,but also it has been shown to yield a small decrease in error rate under matched conditions.Those benefits,together with the fact that it is very simple to implement,is the reason why many current systems have adopted it.
In this paper,we will focus on a group of techniques called statistical matching techniques.The motivation of our statistical thresholding on mean and variance normalization(STMVN)approach is based on the following points.Firstly,as report in[8-9],modeling of speech feature distributors is discussed,which shows that speech feature distributors of each dimension can be well approximated by employing a Gaussian density model in noise environment.Secondly,STMVN has directly physical meanings and can reduce the distance of features between the clean speech and the noise speech,which can be easily achievable
In a discrete signal,if an isolated sample much different from its neighbors can be considered as impulses corresponding to high spatial frequencies.One general way to get rid of this kind of noise is to use the statistical thresholding of a small local region in the discrete signal so that out-of-range noise can be suppressed.Following is a statistical thresholding approach:
where s[m]is a discrete sampled signal at moment m,T is a user specified threshold value,μ[m]and σ[m]are the mean and the standard deviation of a small local region at m,respectively,which in turn are given by
where 2L+1 is the length of the local region.We can see that this thresholding operation in spatial domain corresponds to low-pass filtering in the spatial frequency domain.
This approach in Eq.(1)can suppress isolated out-of-range noise,but only using the mean to substitute the out-of-range noise cannot represent the true distribution of the signal.So,we should combine Eq.(1)with standard deviation:
where sign(x)is the sign function given by
Equation(4)shows that the isolated out-of-range noise will be substituted by mean and a rough associated with the raw data and standard deviation.
An example is shown in Fig.1.The solid line is C4 of noise speech MFCC(mel frequency cepstrum coeficient),and the circles are the out-of-range noise,which will be replaced by the upper or lower border(dotted line).
Fig.1 C4 of noise speech MFCC
Normalizing by removing the mean shift and dividing the standard deviation on Eq.(4),we can obtain
where^s[m]is the normalized signal.
Given an ordered sequence of K-dimensional MFCC feature vectors x(m),m=1,…,M,then the k-th time trajectory of x(m)is denoted as
where x(m,k)is the k-th component of x(m)at time m.Now,we threshold the k-th time trajectory signal yk(m)by Eq.(6)described as follow
where μk(m)and σk(m)can be obtained by Eq.(2)and Eq.(3),respectively.We can find that Eq.(8)is similar with the traditional MVN,except the statisti-cal thresholding operation,and it is called statistical thresholding on cepstral mean and variance normalization(STCMVN).Based on the analysis of Eq.(8),the STCMVN is split to the following two steps:
Step1 Cepstral parameters are processed by CMVN.
Step2 Thresholding operation is performed by a threshold value T.
Fig.2 shows the C4 of clean(dashed)and noise(solid)speech CMVN.The out-of-range noise(circles)will be substituted by threshold value T or-T(T=2).Fig.3 shows a comparison of the average Euclidean distance(per frame)of CMVN and STCMVN between clean speech and noise speech.Here 1 000 voiced speeches are used to perform CMVN and STCMVN with the background of white noise at different SNR levels.We can find that the average distance of STCMVN is smaller than CMVN's at all SNR levels in Fig.3.Combing with Fig.2 and Fig.3,we can find that the features of noise speech become more similar with its clean speech's after thresholding operation,alleviating the mismatch between train and testing environment.
Threshold determination plays a vital role in our proposed approach,neither too high nor too low.We will account this situation from the statistical point of view.
In probability theory, Chebyshev's inequality,which has great utility because it can be applied to completely arbitrary distributions(unknown except for mean and variance),guarantees that in any probability distribution,no more than 1/T2of the distribution's values can be more than T standard deviations away from the mean[11]:
If T is too high,there only little distribution's values more than σT,so the STCMVN will lose its effect,and if T is too small,a great number of distribution's values are out-of-range led to this value substitute by T with serious distortion.Experiments show that the choice of T is usually between 2 to 4.
The speech corpora for the initial experiments included 2 600 Chinese isolate utterances produced by 31 female and 25 male speakers.The speech signal was recorded in normal laboratory environment at 16 kHz sampling rate and encoded with 16-bit quantization.In the testing phase,different background noise types,say the babble noise,the leopard noise,the pink noise,the volvo noise and the white noise,from Noisex[12]are subsequently added to the clean speech waveforms at various SNRs(SNR= -5,0,5,10,15,20 dB).Training utterances are not used in testing.
In speaker-independent isolated word recognition experiments,for each word with 4 states and a mixture of 6 Gaussian pdfs per state was estimated from training utterances spoken in a noise-free environment.An automatic end-pointing algorithm based on frame powers and zero crossings are used to determine the starting and ending points of the training utterances.Models are trained by CDCPM[13]which is one of the simpli-fied HMM with the state transition left-to-right,no initial state distribution π,and state transition probability distribution A.The acoustic model employs the 26-dimension features(containing 13 MFCC coefficients plus the logarithmic frame energy,as well as their first order derivatives).
Experimental results for clean speech and four kinds of artificial noisy data are shown in Tab.1 and Tab.2-5,respectively.The clean and noisy data are evaluated by using the MFCC feature as the baseline firstly.Then the CMS,CMVN,and the proposed method ST-CMVN,with threshold value is 3.2,are applied respectively and compared.
Tab.1 Word accuracy(%)for clean speech
Tab.2 Word accuracy(%)for the Car noisy data
Tab.3 Word accuracy(%)for the Babble noisy data
Tab.4 Word accuracy(%)for the White noisy data
Tab.5 Word accuracy(%)for the Pink noisy data
It can be seen that the proposed and CMVN approaches significantly outperform the baseline(MFCC)in all kinds of experimental environments,and the CMS approach only improves the performance of the word accuracy in higher SNR(SNR≥10 dB).For higher SNR noisy environments(SNR≥10 dB),the improvement of the proposed method is slighter with the reason is that almost all features are slightly-corrugated and the emphasized.But for lower SNR noisy environments(SNR≤5 dB)),we can find that the proposed method has good performance than other methods except for Car noisy data because the Car noisy are stationary and narrow-band.
In particular,the STCMVN method is proved effective in the lower SNR noisy environments(SNR≤5 dB)where the average word accuracy for all kinds of noisy environments is more than 25%.
In this paper,an effort has been made to develop an new approach for the robust speech recognition in noisy environments by using a statistical threshold value to threshold the CMVN features,reducing the mismatch between train and testing environments.The experimental results show that the proposed approach is superior over the other robust features(CMS,CMVN),especially for lower SNR cases(-5 dB≤SNR≤5 dB).
[1]GONG Yifan.Speech Recognition in Noise Environments:A Survey[J].Speech communication,1955,16(3):261-291.
[2]ACERO A.Acoustical and Environmental Robustness in Automatic Speech Recognition[M].UK:Kluwer Aca-demic Publishers,1993.
[3]HUANG X D,ACERO A,HON X.Spoken Language Processing:A Guide to Theory,Algorithm,and System Development[M].New Jersey,USA:Prentice Hall,2001.
[4]VIIKKI Olli,LAURILA K.Cepstral domain segmental feature vector normalization for noise robust speech recognition[J].Speech Communication,1998,25(1-3):133-147.
[5]ACERO A.Acoustical and Environmental Robustness in Automatic Speech Recognition[D].Pittsburgh:Carnegie Mellon University,1990.
[6]BENESTY J,MAKINO S,CHEN J.Speech Enhancement,Signal And Communication Technology[M].UK:Springer-Verlag Berlin and Heidelberg GmbH & Co.K,2005.
[7]CHEN Chia-Ping,BILMES J A.MVA Processing of Speech Features,IEEE Transactions on Audio[J].Speech and Language Processing,2007,15(1):257-272.
[8]GAZOR S,ZHANG W.Speech Probability Distribution[J].IEEE Signal Processing Letters,2003,10(7):204-207.
[9]DU Jun,WANG R H.Cepstral Shape Normalization(CSN)For Robust Speech Recognition[C]//Proc.of ICASSP,[s.l.]:IEEE Press,2008:4389-4392.
[10]Internet Center for Management and Business Administration,Inc.Statistics[EB/OL].[2012-03-10].http://www.quickmba.com/stats/dispersion.
[11]Chebyshev's inequality,Wikipedia[EB/OL].[2012-03-11].http://en.wikipedia.org/wiki/Chebyshev's_inequality
[12]VARGA A P,STEENEKEN H J M,Tomlinson M,et al.The NOISEX-92 study on the effect of additive noise on automatic speech recognition[R].Malvern,UK:Speech Research Unit,Defense Research Agency ,1992.
[13]ZHENG Fang,CHAI H-X,SHI Z-J.A Real-World Speech Recognition System Based on CDCPMs[C]//'971nt.Conf.Computer Processing of Oriental Langunges(ICCPOL'97),Apr.2,1997 ,Hong Kong:[s.n.],1997:204-207.