LILan,LIN Guoliang,MA Shaobin
(School of Digital Media,Lanzhou University of Artsand Science,Lanzhou Gansu 730000,China)
Abstract: In order to solve the problems of the limited receptive field,low-resolution,high complexity and loss of edge information in the super-resolution reconstruction method of residual learning,adilated residual convolution neural network is proposed.Firstly,we design the sawtooth dilated convolution based on the ResNet network to expand the receptive field of the network and eliminate the“zero filling”of the network,the image features are transferred to the deeper network by adding the jump connection.Secondly,the residual image with the same size as the original image is obtained through the last convolution layer.Finally,the input LR image and the residual image are linearly fused to output the final super-resolution image.The experimental data on set 5 and set 14 shows that compared with the existing algorithms,the algorithm of this paper has better reconstruction effect and better learning performance.
Key words:residual network;dilated convolution;deep learning;image super-resolution reconstruction
The idea of super-resolution image reconstruction(SRIR or SR)is to use a group of low quality and low-resolution images(LR)to obtain single or multi frame high quality and high-resolution images through computer technology and image processing technology[1,2].SR is an important research direction of digital image processing,and has a wide range of applications in the field of computer vision,such as intelligent transportation,safety monitoring,image generation and medical imaging[3].At present,SRmethodsare mainly divided into 3 categories:difference-based method[4],reconstructionbased method[5]and learning-based method[6,7].The difference-based method isto take the imageasapoint,and usetheprior knowledge to fit the unknown information on the plane by a predef ined transformation function or interpolation,so as to calculate the high-resolution image.The main disadvantage of this method is that it is easy to appear the phenomenon of ladder sawtoothand edgeblurring.Thereconstruction-based method isoneof thewidely studied.Thiskind of method mainly usestheunder-sampling technology to fusethemultiframeinformation of low pixel accuracy on oneor morelow-resolution imageinformation,and reconstructtheimagewith higher resolution.Thismethod hashigh reconstruction accuracy,butitcan only makeuseof therelationship between high-resolution and low-resolution images,so it isdifficult to build amathematical model,and thetextureof thereconstructed imageisnot clear.
In recent years,thedeep learning-based method hasbecomearesearch hotspot.It mainly usestheinternal similarity of thesameimage and alargenumber of training sampledata to find themapping relationship between high and low-resolution image pair,and complete the transformation of high-resolution image features,so as to realize the SR process.This methodrequires very high centralized characteristics of modeling data.Dong et al[8]firstly used the deep learning method.A total of three-layer convolution neural network(CNN)was set up in the network to realize super-resolution reconstruction,and achieved good results.Since then,it has opened the upsurge of deep learning to realize SR.In the process of training,with the increase of the number of network layers,there will be problems such as too many hyper parameters,gradient diffusion/explosion and so on.The reconstructed image is usually too smooth,losing high-frequency details,and the image quality still needs to be improved.
Kim etal[9]proposed adeep residual network(VDSR)model,which usesresidual learning to acceleratetheconvergence speed of the network.It is proved that thismethod can improvetheperformance of super-resolution,but it will increase the computational complexity and thegradient disappearance.Yang et al[10]used thesparsity of theimageto constrain thesparse representation under the dictionary corresponding to the high and low-resolution images to realize image super-resolution reconstruction.The reconstruction effect of this method is good,but the disadvantage is that dictionary training takes a long time and there will be noise at the edge of the image.Tai et al[11]proposed DRRN model,which uses recursive network module with weight sharing to increase the network depth to 52 layers and reduce the parameters of the model.However,each recursiveunit isnot optimized enough and thereconstruction effect isnot obvious.Yu et al[12]and Chen et al[13]proposed the dilated convolution.By changing the size of convolution kernel without adding parameters in the network,we can obtain larger receptive field and obtain more original image information,and which achieves good results in reconstruction.However,thismethod will appear thephenomenon of”gridding”,and moreimageinformation will belost after convolution.
In this paper,we improve the above methodsand propose an image super-resolution reconstruction method combining residual network and sawtooth dilated convolution.Themodel usesthedilated convolution network to extractimagefeatures,then usesresidual network combined with sawtooth holeconvolution to carry out imagenonlinear mapping,and then applies convolution network to obtain residual image with the same size as the input image.The final super-resolution image is obtained by linear fusion of low-resolution imageand residual image.Adaptivemoment estimation(Adam)isused to speed up the convergence of the network.Dilated convolution is used to enlarge the receptive field of image features and recover the texture information of the image with high quality,which improvesthe visual effect of the reconstructed image.
Dilated convolution is a data sampling method on the feature map.The network expansion coefficient is increased by adding 0 pixel value between each pixel of ordinary convolution kernel,it can effectively increase the receptive field without increasing themodel parameters or calculation.Dilated convolution can be applied to imageglobal information or voice text which needslong sequenceinformation dependence[10].
For a3×3 convolution network,itsexpansion rateand receptivef ield areshown in Fig 1.
Fig 1 3×3 pixelsconvolution with different dilation rate
Fig 1(a)isageneral convolution(holeconvolution with expansion rate=1),and thedatarepresented istheconvolution of 3×3 pixelsin theoriginal figure,and thereceptivef ield is3×3,equivalent to no expansion,rrate=1;Fig 1(b)isthedilated convolution with expansion rate rrate=2,receptive field is 7×7,dilation=2;Fig 1(c)is dilated convolution with expansionrate rrate=4,receptive field is 15×15,dilation=4.The improvement of dilated convolution to ordinary convolution is to obtain larger receptive field.The calculation of receptive field isexpressed by formula(1).
where ksizerepresentsthesizeof convolution kernel,rraterepresentstheexpansion rateof dilated convolution,and v represents the sizeof receptivef ield.
Theinternal featureinformation of theimagecan not only beretained,but also thelossof resolution caused by pooling can be avoided using the dilated convolution.However,it still has some defects:the adjacent pixels are convoluted from independent subsetsin the samedilated convolution which result in acertain layer,and the“grid”effect iscaused dueto the lack of mutual dependence and theinformation discontinuity.The specif ic convolution is shown in Fig 2.
Fig 2 Holeconvolution with thesameexpansion rate
In Fig 2(a),the convolution layer with expansion rate rrate=2,and the receptive field is 5×5;Fig 2(b)shows the second convolution layer,the receptive field is 5×5.Fig 2(c)is the result of one convolution for the dilated convolution with expansion rate rrate=2,and the receptive field is 9.If all the dilated convolutions use the same expansion rate,the calculation method issimilar to thechessboard format,and thereisno dependencebetween large-scaledata.With thefurther increaseof thedepth of thenetwork,theimportant information will belost.To solvethisproblem,thispaper proposesasawtooth mixed dilated convolution,asshown in Fig 3.
Fig 3 Convolution results with different expansion rates
Ascan beseen from Fig 3,thereceptivef ield isincreased mean whileh as no blind areacovering the who leareabased on the sawtooth dilated convolution,so it can effectively extract features and improve the accuracy of reconstruction.Because of 0 pixel valueisnot involved in convolution operation,and the complexity remainsunchanged.
For CNN,simply increasing the depth will le ad to gradient dispersion or gradient explosion.He et al[14]proposed the residual learning network ResNet,which directly connects shallow network and deep network by adding jump connection(identity mapping).In ResNet,there is abatch normalization(BN)after each convolution layer.It ispointed out in reference[15]that BN layer can improve the generalization ability of the network and accelerate the convergence process of training.However,the spatial information of the imageisdestroyed and the training parametersare increased in acertain extent,whichleadsto poor network performance.Therefore,the ResNet is improved in this paper.The convolution layer in residual unit adoptsdilated convolution,and BN layer is removed,which is helpful to obtain better image super-resolution reconstruction results.
Fig 4 Residual network unit
If the same expansion rate is used for convolution,the convolution kernel will produce“grid”phenomenon.At the sametime,itwill carry redundant in form ationand causeunnecessary memory occupation if the convolution kernel volume is too large.Therefore,the sawtooth dilated residual unit is designed in this paper,and the feature map with the same structure as the standard convolution is obtained by cyclic operation,which can increase the receptive field of the network while retaining the detailed information of the image,so it can better fit the target boundary and improve the image reconstruction quality.Therefore,the sawtooth dilated residual unit is designed in this paper,and the feature map with the same structure as the standard convolution is obtained by cyclic operation,which can in crease there ceptive field of the network while retaining the detailed information of the image,so it can better fit the target boundary and improve the imagereconst ruction quality.The structure of serrated cavity residual element and ResNet residual element is shown in Fig 4.
The idea of this algorithm is to obtain more image information using smaller convolution kernel,not increasing the number of layers and complexity of the network,can speed up the convergence speed of the network,at the same time to avoid the“blind area”phenomenon because of holes.Firstly,the input image of the residual network is trained by dilated convolutionwith differentexpansion rates,and then thesuper-resolutionimageisreconstructed by linear addition of theinput low-resolution image and the output residual image from theresidual network.
With theincreasing depth of the network,we find that the deeper CNN network layer,the better the performance.The convergence speed of the network will be affected with increasing the number of layers when the network reaches a certain depth,and thereceptivef ield will bedecreased,which can causethereduction of context information and poor reconstruction effect.He Kaiming et al[14]proposed thedeep residual network in 2015.In imageclassif ication of the ImageNet,theaccuracy of network classif ication can be improved by increasing network depth,and the problems of gradient disappearance and network performancedegradation can besolved by residual learning.
Fig 5 Network structure of single image super-resolution reconstruction method
Based on reference[13],a residual network of sawtooth dilated convolution is constructed in this paper.The overall structure of the network is shown in Fig 5,including 20 convolution layers.According to the function,it is divided into 5 parts:low-resolution imageinput part,convolution featureextraction part,thefollowed by which is6 dilated residual blocks,each residual block contains 3 residual units,and the next is the residual image feature layer.The feature map size of each output convolution layer can be unchanged using the dilated residuals.
Thenetwork input isformed by a single color channel of the image data,and ispreprocessed by Bicubic difference.
Thedilated residual sawtooth convolution neural network super-resolution reconstruction proposed in thispaper mainly includesthefollowing threesteps:
(1)Featureextractionlayer(Conv+ReLulayer).Exceptthatthesizeof convolutionkernel inthelastlayer is1pixel,the convolution featuresizeof theother middlelayersis3×3×64 pixels.In thefeatureextraction layer,theinputlow-resolution image X is convoluted with a convolution kernel of 3×3×1 pixel size to obtain n1feature maps(here n1=224),where 1 representsthe number of channels.The feature extraction layer contains a convolution layer and an activation function,and each neuron obtained from this layer is transferred to theresidual unit module.The expression of featureextraction process isasfollows:
where*is the convolution operation,W1istheconvolution kernel of the first convolution layer;b1is the offset term,which isconsistent with that of theconvolution kernel in dimension;B0istheinput featureof thef irst residual block extracted from the input x;fisthe ReLU activation function.
In this paper,the application of ReLU in the network can accelerate the training speed and shorten the convergence time of the model,and at the same time,it can restrain the phenomenon of gradient disappearance in a certain extent.Its performance is better than thetraditional activation function Sigmoid[16](gradient explosion and gradient loss are caused by gradient reversetransmission in deep convolution network),the expression is as follows:
where xiistheinput of the ReLU function and f(x)iistheoutput of the ReLU function.
(2)Nonlinear mapping.For low-resolution image of the input,the sawtooth residual network proposed is used for image training.The structure of the residual module is composed of 6 dilated residual units,each of which is composed of 3 dilated convolution layers and 3 nonlinear activation function ReLU.Each residual cell hastwo parts:jump connection and identity mapping.In this way,the residual information can be retained,and the image featurescan be transferred backward by jumping connection,which helps to maintain the diversity of features.
Theconvolution kernel of 3×3 pixelsisused,3 convolution layerswith 1,2,3 areconnected in seriesto form aresidual block.In order to keep thesizeof convolution kernel unchanged,theoutputof 13×13featuremap isguaranteed attheend of network calculation.After thedilated residual block,thereceptivef ield isexpanded and themoreoriginal imageinformation isobtained.The expression of each residual cell is asfollows:
where Hm-1and Hmisthe input and output of the m-th residual unit respectively,fistheresidual mapping learned,that is:
where,Wim,i=1,2,3 is the weight of the i convolution layer learned,which is a simplif ied formula with omitting the bias term,fis ReLU function and*isconvolution operation.
(3)Image reconstruction and output.Firstly,the output featuremap of theresidual network istaken astheinput of the last convolution layer,which is convoluted with the 3×3 pixels convolution kernel to generate the residual imageof thesamesize asthe input image,and then islinearly superimposed with theinput interpolation imageto outputthe final super-resolution image.The structureof image reconstruction phase is shown in Fig 6.
Fig 6 Image reconstruction phase
The sawtooth dilated residual is applied in the network to make the input image equal output image in size,and the problem of network convergence is solved.x represents the input image after the double cubic difference,fRes(x)represents the residual image of the input image after being passed through the whole neural network,y represents the original highresolution image,and y*representsthesuper-resolution imagepredicted by thenetwork,that is:
In this paper,the Mean Square Error(MSE)function is used as the loss function of the whole network.The minimum value is achieved and the optimal solution is obtained by calculating the MSEof the image generated and the original highresolution image.Thecalculation formulaof lossfunction is:
where n is the number of samples,yiis the high-resolution image,f(xi;Θ)is the prediction output image of the network,Θ={w1,w2,···,b1,b2,···}.
Theexperiment environment of theimagesuper-resolution reconstruction algorithm isshown in Tab 1.
Tab 1 Experiment environment
Because the network is relatively deep,the algorithm needs to use larger training sets to get better training results.291 images which are the same as the reference[13]are selected as the training sets,and these imagesare respectively from 91 images proposed in reference[18]and 200 from BSD(Berkeley Segmentation Dataset)[13].In order to make full use of the depth image,the dataset images were flipped horizontally,vertically and horizontally vertically,and scaled according to the coefficients of 0.9 and 0.8.Then the images were saved and a total of 5 820 images were generated,and the image size was no morethan 512×512.
Intheprocessof trainingimagesusingthisnetwork,thetrainingimagesaretransformed into YCbCr space[19].Compared with change of color(CBCRchannel information),human beingsaremore sensitiveto thechangeof brightness(Y channel),so thenetwork designed in thispaper only dealswith y channel.Theimageconvolution kernel sizeis3×3,and thenumber of characteristic imagechannelsis64.Theoptimization algorithm used in training is Adam[20].Compared with SGD(stochastic gradient descent,SGD),Adam optimization method ismoref lexibleand adaptive,which can control thelearning rateof each iteration within a certain range,making the parameter learning more stable.The initial learning rate of the network is set to 10-4,themomentum parameter is0.9,and themini-batch training modeisadopted,and thebatch sizeis64,which isreduced to half of theoriginal valueevery 100 000 iterations.
In theexperiment,set 5 and set 14 areused astest sets,the original high-resolution imageis X,the magnif ication isset s=2,3,4,and the preprocessed image isused for the input of thenetwork.
Evaluation method of image super-resolution reconstruction is to verify whether the image super-resolution reconstruction method meets the expectation.At present,there are two methods to evaluate the quality of reconstructed images:subjectiveevaluation and objective evaluation.
Subjective evaluation mainly refers to judging the sensory difference of the reconstructed image through human eye and prior knowledge,such asthesimilarity between the texture,color and other featuresof the imageand theoriginal highdef inition image.Subjective evaluation mainly depends on the artif icial aesthetic standards,and there are some errors in image quality judgment dueto various factors.
Objective evaluation is to quantitatively analyze image resolution and evaluate image quality using specif ic indicators.The common objective evaluation index includes peak signal to noise ratio(PSNR)and structural similarity index(SSIM).PSNRis to measure whether there is distortion in the reconstructed image by calculating the ratio of the maximum value of pixelsto thepower of additive noise.Thelarger thevalueis,thecloser theperformanceand authenticity of the reconstructed imageisto theoriginal high-resolution image.Thiscalculation method isdirectly related to themean squareerror(MSE)of theimage.Thecalculation formulaisasfollows(7).
where H,W isthesizeof theimage,MSE isthemean squareerror,ISRisthereconstructed image,I istheoriginal image,L isusually taken as255.The larger the PSNRvalueis,thesmaller thedistortion of theoutput imageis,and thecloser it isto the original image.
SSIM is a comprehensive measure of the similarity in structure,brightness and contrast between two or more images.Thecloser thevalueisto 1,thebetter theoutput imagequality is.Thespecif ic formulaisshown in formula(9).
Where uIand uISRrepresents the mean value of the original image I and the reconstructed image ISR;σIandσISRrepresents thevarianceof theoriginal image I and thereconstructed image ISR;σIISRrepresentsthe covariance of the two;C1and C2are constants.
The algorithm in this paper is compared with the representative methods in the field of single image super-resolution reconstruction,such as Bicubic algorithm,FSRCNN algorithm and VSDR algorithm.The subjective effect comparison is shown in Fig 7~Fig 9,and thered box istheinteresting region.
Fig 7 Reconstruction results of different algorithms on butterf ly
Fig 8 Reconstruction results of different algorithms on baboon
Fig 9 Reconstruction results of different algorithms on woman
It can be compared from Fig 7~Fig 9 and find that the local receptive field of the double Bicbuic algorithm is small,and the availableregional features are single,and the high-resolution reconstructed image still has the disadvantage of edge def inition,and theimagereconstruction effect ispoor.Theprocessing effect of FSRCNN and VSDRalgorithmson baboon’s hair and mouth isworsethan that of theoriginal high-resolution image.
Thealgorithm of thispaper hasagood effect in distinguishing imageedgeand improving texturedetails,and hasagood improvement on the clarity on head ornament texture details of the woman.
Tab 2 PSNRand SSIM results of different methods on set 5 dataset(magnif ication=2,3,4)
PSNR and SSIM is used to evaluate the proposed algorithm objectively,and is compared with Bicubic algorithm,FSRCNN algorithm and VSDRalgorithm respectively.Thescalefactor of input image ismagnif ied by 2,3,4 timesin set 5 and set 14 test sets.After experiment comparison of different methods,thecomparison resultsof reconstructed dataarelisted in Tab 2.
It canbeseenfromthetable2,compared with Bicubic algorithm,FSRCNNalgorithmand VSDRalgorithm,theaverage PSNR of proposed algorithm is increased by 0.33 dB,0.338 dB and 0.324 dB respectively when the expansion factor is 2,0.282 dB,0.208 dB and 0.898 dB is increased when the expansion factor is 3,and 0.4 dB,0.464 dB,1.048 dB is increased when theexpansion factor is4.SSIM isincreased by 0.041 3,0.009 6 and 0.023 2 when theexpansion factor was 2,0.016 0,0.006 6 and 0.005 2 when the expansion factor was3,0.011 6,0.007 3 and 0.011 2 when the expansion factor is 4.
Tab 3 PSNRand SSIM resultsof different methods on set14 dataset(magnif ication=2,3,4)
It can beseen from thetable3 compared with Bicubic algorithm,FSRCNN algorithm and VSDRalgorithm,theaverage PSNRof proposed algorithm isincreased by 0.45 dB,0.35 dB and 0.23 dB respectively when theexpansion factor is 2,1.8 dB,0.67 dB and 0.05 dB is increased when the expansion factor is 3,and 1.3 dB,1.0 dB and 0.8dB is increased when the expansion factor is 4.SSIM is increased by 0.05,0.04 and 0.02 when the expansion factor was 2,0.8,0.0096 and 0.001 7 when the expansion factor was0.039 9,0.027 and 0.015 when the expansion factor is 4.From the quantitative data analysis,wecan seethat thesawtooth dilated residual convolution can enhancethetextureinformation and reconstruction effect of the reconstructed imageto acertain extent,reducetheoperation timeand improvetheimagereconstruction efficiency.
In thispaper,an improved imagesuper-resolution reconstruction method isproposed based on VSDRalgorithm.Firstly,the low-resolution image is interpolated processing into the network,and then the image features are extracted by one-time convolution.Secondly,the nonlinear mapping of image features is realized by using 6 continuous sawtooth dilated residual convolutions,and can effectively avoid the“grid”phenomenon which is caused by the same expansion rate.The receptive field is expanded without changing the image size,and get the output residual image,and the problemsof network gradient disappearanceand gradient explosion issolved.Finally,theoriginal low-resolution imageand theresidual imagearelinearly added to output thef inal high-resolution image,which improvesthe quality and efficiency of reconstruction.Theexperiment resultsshow that:compared with the other three methods,the proposed algorithm achieves better resultsin PSNRand SSIM for a single image.Compared with the original color high-resolution image,there is still a gap in texture and clarity.In the next research work,thesawtooth dilated convolution isapplied to abetter deep learning framework,and moreexpansion rate combinations are tried to achieve image super-resolution reconstruction and better use of image features.