Image-to-Image Style Transfer Based on the Ghost Module

2021-12-14 06:06YanJiangXinruiJiaLiguoZhangYeYuanLeiChenandGuishengYin
Computers Materials&Continua 2021年9期

Yan Jiang,Xinrui Jia,Liguo Zhang,2,*,Ye Yuan,Lei Chen and Guisheng Yin

1College of Computer Science and Technology,Harbin Engineering University,Harbin,150001,China

2Heilongjiang Hengxun Technology Co.,Ltd.,Harbin,150001,China

3College of Engineering and Information Technology,Georgia Southern University,Statesboro,30458,GA,USA

Abstract:The technology for image-to-image style transfer(a prevalent image processing task) has developed rapidly.The purpose of style transfer is to extract a texture from the source image domain and transfer it to the target image domain using a deep neural network.However,the existing methods typically have a large computational cost.To achieve efficient style transfer,we introduce a novel Ghost module into the GANILLA architecture to produce more feature maps from cheap operations.Then we utilize an attention mechanism to transform images with various styles.We optimize the original generative adversarial network (GAN) by using more efficient calculation methods for image-to-illustration translation.The experimental results show that our proposed method is similar to human vision and still maintains the quality of the image.Moreover,our proposed method overcomes the high computational cost and high computational resource consumption for style transfer.By comparing the results of subjective and objective evaluation indicators,our proposed method has shown superior performance over existing methods.

Keywords:Style transfer;generative adversarial networks;ghost module;attention mechanism;human visual habits

1 Introduction

Deep learning has shown excellent performance in various image processing tasks,e.g.,image generation [1],object detection [2]and tracking [3],image classification [4],scene text recognition [5],and style transfer.Generally,the goal of style transfer is to learn and extract the styles of the source image,and then apply the extracted styles to the target image.In early style transfer research,a supervised learning strategy was mainly adopted to complete the style transfer.The style transfer based on supervised learning needs to acquire the paired training data,i.e.,the source image and corresponding target image with the same image content and the different image styles.However,obtaining a large amount of paired data is a time-consuming and costly operation.Hence,semi-supervised or unsupervised learning-based methods have been proposed by researchers for style transfer to solve the above problems.

Zhu et al.[6]proposed a cycle-consistent adversarial network (CycleGAN) for cross-domain image style transfer.It breaks the strict requirement that supervised learning-based methods require paired training data.CycleGAN can use unpaired training data to complete the style transfer,and the image contents do not need to be the same.Since unpaired data can be used in style transfer,time is saved when obtaining data.A number of methods based on unpaired data have been proposed.For instance,the dual generative adversarial network (DualGAN) [7]was proposed to achieve style transfer based on unpaired training data.Chen et al.[8]proposed a novel framework named CartoonGAN,based on the GAN,for the cartoonization of photos using unpaired training data.Unlike the CycleGAN,CartoonGAN introduced a semantic content loss and an edge-promoting adversarial loss for coping with the abundant style variation within photos and cartoons and preserving clear edges,respectively.Both CycleGAN and DualGAN can effectively transfer the different styles of images.However,they cannot transfer the style and content of the image simultaneously.To address this problem,Hicsonmez et al.[9]proposed a GAN architecture (named GANILLA) for image-to-illustration translation.However,the above methods have a common shortcoming:a large computational cost.Besides the computational cost problem,the generated image’s quality,the number of parameters,and the number of floating-point operations (FLOPS) also need to be considered.

Figure 1:Sample with different styles obtained by our proposed method

To achieve efficient style transfer by using unpaired training data,we utilized the GANILLA,which uses low-level features to retain content while transforming styles.Then,we made the following improvements.1) We redesigned the convolutional module of Residual Neural Network-18(ResNet-18) [10],where a cheap linear model transformation (i.e.,the convolution operation of GhostNet [11]) was used to build a lightweight network architecture.This improvement can reduce the number of parameters and FLOPs.2) The attention mechanism was introduced into our proposed network.By adding an attention layer from the second layer to the fourth layer in our proposed network,we enhanced the useful features and suppressed the less useful features.Current style transfer was mainly employed to compare oil painting style results.In this paper,we present several different styles of generated images.e.g.,seasonal style transfer,stick figure style transfer,and cartoon style transfer;these are shown in Fig.1.The rest of this paper is organized as follows.In Section 2,related work is described,and in Section 3,we introduce the ghost module,attention mechanism,network architecture of our proposed method,and the loss function used.In Section 4,the complexity analysis and implementation details are described,and the results of various generated different styles images are demonstrated,and,in Section 5,we adopt two evaluation indicators,i.e.,subjective evaluation and objective evaluation,to evaluate the results of our method and comparison method.Finally,in Section 6 conclusions are presented and future research discussed.

2 Related Work

In recent years,the GAN [12]has been widely employed in the field of deep learning [13],and it consists of a generator and a discriminator.The purpose of the generator is to learn the feature distribution of the training data.The discriminator is employed as a classifier to classify the data,i.e.,whether the data is generated by the generator or real samples.Hence,the training process of the GAN can be regarded as an adversarial game.The adversarial training process is complete once the generator can output the data the distribution of which is the same as that of real data.i.e.,the discriminator cannot distinguish between the correctly generated data and real data.Benefiting from its strong generating ability,the GAN has also been employed in style transfer [14].

Style transfer is a hot topic in computer vision.The existing methods can be divided into two strategies according to the training data used:paired training data or unpaired training data.Style transfer based on supervised learning method needs to use paired training data directly.For example,Isola et al.[15]explored a GAN suitable for image-to-image translation tasks;their method is called Pix2Pix.Pix2Pix is different from prior works in its generator and discriminator architectures.The U-Net and PatchGAN classifiers are employed as the generator and the discriminator of Pix2Pix,respectively.To solve the unstable training and the generated image quality being unsatisfactory faced by Pix2Pix,Wang et al.[16]proposed Pix2PixHD.They used a coarse-to-fine generator and a multi-scale discriminator architecture,and modified the adversarial loss to achieve style transfer.Experimental results indicated that Pix2PixHD could effectively generate high-resolution images.Although the above methods can effectively transfer different styles,obtaining paired data is very difficult,time-consuming,and laborious.Compared with paired data,unpaired data is easier to obtain.Hence,researchers have proposed many methods based on unpaired data.

CycleGAN is a pioneering method that uses unpaired image style transfer based on the idea of unsupervised learning.Besides the adversarial loss of the original GAN,CycleGAN also utilizes the cycle consistency loss,which consists of the forward cycle consistency and the backward cycle consistency.By combining adversarial loss and cycle consistency loss,Cycle-GAN has achieved good performance on several tasks,e.g.,the collection style transfer,photo enhancement,season transfer,and object transfiguration.However,the CycleGAN faces problems with poor quality,mapping ambiguity,and model sensitivity.Li et al.[17]proposed an asymmetric GAN (AsymGAN) to solve these problems.AsymGAN uses an auxiliary variable,which can provide more information when transferring images from an information-rich domain to an information-poor domain.AsymGAN can generate better quality images and mitigate the sensitivity convergence problem.After the CycleGAN was proposed,several style transfer methods based on the GAN and using unpaired data were proposed.For instance,CartoonGAN made the GAN’s architecture simpler and more effective.Moreover,two novel loss functions were designed,i.e.,the semantic content loss and marginal promotion loss.The CartoonGAN can train the photos and cartoon images directly,and hence is simple to use.This method not only constructs sparse regularization in the VGG network [18]and realizes the conversion between photos and cartoons,but it also makes the photos clearer.Later,a unified quality-aware GAN(QGAN) [19]was designed to solve the data underrepresentation problem.The QGAN uses a multi-precision quantization based on the expectation-maximization algorithm,which provides the optimal number of bits configuration with the quality loss.Emami et al.[20]proposed a spatial attention GAN (SPAGAN) model that introduced the attention mechanism to the GAN architecture.SPAGAN used the attention mechanism to assist the generator in paying attention to the most discriminative regions between the source and target domains.

Although the above style transfer methods show significant progress,they still cannot solve the complex trade-off between image style and content.CycleGAN is very successful in transferring style,but it is not as successful in transferring content;CartoonGAN is successful in preserving image content,but it has shortcomings in delivering style.To this end,Hicsonmez et al.developed GANILLA,which can produce obvious styles but still retain content.By migrating the style of a given illustration,the transition from natural images to painting style illustrations can be achieved.Furthermore,the low-level and high-level features are merged by using skip connections and upsampling.Overall,GANILLA is a relatively successful example of the style transfer methods using unpaired data.It overcomes the shortcomings of earlier methods and can maintain content while transferring the style.However,the style transfer process of GANILLA needs a large number of parameters and FLOPs.Therefore,we employed the Ghost module to construct a lightweight style transfer network that can reduce the number of parameters and FLOPs.

3 Proposed Method

3.1 Ghost Module for More Features

A well-trained convolutional neural network (CNN) usually includes rich feature maps to ensure a superior semantic understanding of the input data.We referenced the convolution operation in GhostNet to generate more feature maps with fewer parameters,as shown in Fig.2.The number of ordinary convolution layers needs to be strictly controlled.We used a series of simple linear operations to produce more feature maps according to the inherent feature map of the ordinary convolution layers.The linear operation is a depthwise convolution [11].Unlike the ordinary convolution operation,depthwise convolution performs its operation on each channel separately.Hence,the number of filters is the same as the number of channels.However,in the ordinary convolution operation,each filter operates in each channel of the input image simultaneously.The new channels’feature maps are obtained after completing the convolution operation in each channel.Then we perform a standard 1×1 cross-channel convolution operation on the new batch of channel feature maps.Utilizing the depthwise convolution can effectively reduce the number of parameters and computational complexities without changing the size of the output feature maps.

Figure 2:An illustration of the Ghost module

LetX∈Rc×h×wbe the input feature maps,wherecdenotes its number of input channels,andhandwdenote the maps’height and width,respectively.The following formula is adopted to illustrate that the convolution generatesnfeature maps:

whereY∈Rn×h′×w′represents the output feature maps withnchannels,andf∈Rc×k×k×nrefers to the convolution filters of the current layer,bis a bias term,and * represents a convolution operation.Additionally,the width and height of the output feature maps are represented byh′andw′,respectively,whilek×kstands for the kernel size of the convolution filter.In such a convolution process,the number of FLOPs is described asn×h′×w′×c×k×k.The number of FLOPs is usually prodigious in that the number of filtersnand the number of channelscare typically immense.

As indicated by the above formula,it can be clearly established that the number of parameters(infandb) is actually dominated by the dimension of the input and output feature maps.There is usually significant redundancy in the output feature maps,which would lead to a decrease in computational efficiency.After some cheap transformations,the output feature maps are similar to those produced by the Ghost model of intrinsic feature mapping.The mapping of these intrinsic features is mostly generated by ordinary convolution filters,and thus they are relatively small.Specifically,the primary convolution generatesmintrinsic feature mapsY′∈Rm×h′×w′:

wheref′∈Rc×k×k×mis the filter used,m≤n,and the bias terms are omitted for simplicity.To keep the spatial size of the output feature maps consistent,the hyper-parameters (e.g.,filter kernel size,stride,and padding) are similar to those in the ordinary convolution during the convolution process.To acquire the requirednfeature maps,we employ cheap linear operations on each intrinsic feature inY′to obtainsGhost feature maps:

wherey′irepresents thei-th intrinsic feature maps inY′,and Φi,jis thej-th (except the last)linear operation to generate thej-th Ghost feature mapsyij.In other words,there can be one or more Ghost feature mapsThe last Φi,sdenotes the identity maps used to hold the intrinsic feature maps,as shown in Fig.2.From Eq.(3),we can obtainn=m×sfeature mapsY=[y11,y12,...,yms]as the output of the Ghost module;this is also shown in Fig.2.Note that the cost of performing linear operation Φ on every channel is significantly lower than that of the ordinary convolution.

3.2 Network Architecture

For the entire generator network,we used the same architecture as GANILLA to merge low-level features with high-level features while transforming styles.The model consists of two stages:down-sampling and up-sampling,and the down-sampling stage used a modified ResNet-18 network.However,the parameters and calculations of the ResNet-18 network are extensive.To address this problem,some approaches have been proposed to compress the deep neural network (DNN),e.g.,network pruning [21,22],low-bit quantization [23,24],and knowledge distillation [25,26].Redesigning an efficient network architecture is also an effective solution.Recently,there has been some considerable success on redesigning networks with MobileNet [27,28],ShuffleNet [29],and GhostNet.Inspired by GhostNet,our networks apply the Ghost module to style transfer,thereby redesigning the convolution module of ResNet-18.In this way,a lightweight network architecture is built.

As shown in Fig.3,the down-sampling stage starts with a Ghost module layer,followed by an instance norm (IN) [30],rectified linear unit (ReLU),and max-pooling layers.Each of these four layers contains two residual blocks (RBs).In Layer-I,each RB initiates with one Ghost module layer,followed by the IN and ReLU.Next are a Ghost module and an instance normalization layer.In Layers-II-IV,a SELayer is added after each RB.Finally,these concatenated feature maps are fed to the last convolution and ReLU.

Figure 3:The generator architecture network of our proposed method

Our proposed method first performs down-sampling to extract the structural features,and then up-sampling to generate the image.Among them,the down-sampling is based on the modified Resnet-18 network,and the up-sampling combines low-and high-level features through the skip connections.In each layer of down-sampling,it is necessary to connect the features of the previous layers to integrate low-level features.The bottom layer can ensure that the output image contains the input’s content,i.e.,morphological features,edges,shapes,and other information.In the up-sampling stage,the outputs of each layer in the down-sampling are used to feed the lowerlevel features to the summation layers.Different from down-sampling,up-sampling is conducted on the output of Layer-IV by long and skip connections to add lower-level features from the down-sampling;these connections contribute to the content of the generated image.Finally,the output of the network is a stylized image with three channels.All filters in the up-sampling have a 1×1 kernel.Eventually,the 3-channel translated image is output by a convolution layer with 7×7 kernels.Our method uses a 70×70 PatchGAN [15]as the discriminator,which is comprised of three blocks.For the first block,the kernel size is set as 64.For each consecutive block,the kernel size is set as 128.

3.3 Attention Mechanism

We introduced squeeze-and-excitation networks (SENet) [31]to each residual block of the second,third,and fourth layers in down-sampling.We improved network performance by explicitly modeling the interdependence between the feature channels.However,explicit modeling does not result in a novel spatial dimension for the fusion of the feature channels.Hence,we utilized a new feature recalibration strategy.Through this learning strategy,we can obtain each feature channel automatically and thus promote useful features and suppress useless features.

Fig.4 is a schematic diagram of the SE module.Given an inputX∈RC′×H′×W′,C′is the number of feature channels.After general transformations such as convolution (Ftr),the feature mapsU∈RC×H×Ware obtained.The SE module is different from that in the traditional CNN in that we recalibrate the previously obtained features through three operations.The first operation is the squeeze.We apply feature compression along the spatial dimensions,and then transform every two-dimensional feature channel (H×W) into a real number.This real number provides a global receptive field to a certain extent,and its output dimension matches the number of input feature channels.It describes the global distribution of the feature channel,which is very useful in style transfer.The second operation is excitation,which is similar to the self-gating operation of the recurrent neural network (RNN) [32].The parameterwis applied to each feature channel to generate weights,and is learned to explicitly model the correlation between the feature channels.Finally,there is a reweighting procedure.We regarded the weight of the output of excitation as the importance of every feature channel.The normalized weights for each channel feature are weighted simultaneously by multiplying the weight coefficients channel by channel to introduce the attention mechanism.

3.4 Loss Function

We next optimized the generator and discriminator of our proposed method.Our loss function is similar to that of GANILLA,and consists of two components:adversarial loss and cycle consistency loss.These losses are first applied in CycleGAN to achieve the transformation between domainXand domainY,as shown in Fig.5.

1) CycleGAN contains two mapping functionsG:X→Y,F:Y→X,and the corresponding discriminator (DYandDX).DYexcitesGto convertXto the target domainY.Similarly,DXstimulatesFto transformYinto theXdomain.CycleGAN introduces cycle consistency loss to normalize the mapping.Therefore,after the imagesXandYare converted from the source domain to the target domain,they can be returned from the target domain to the source domain.

Figure 4:Squeeze and excitation block

Figure 5:Structure diagram for CycleGAN

2) Forward cycle consistency loss:X→G(X)→F(G(X))≈X

3) Backward cycle consistency loss:Y→F(Y)→G(F(X))≈Y

In CycleGAN,the adversarial loss is used to match the data distribution of the generated image and the object images.The cycle consistency loss is used to prevent conflict of learning mappingsGandF.In our experiments,we not only used the above-described adversarial loss and cycle consistency loss,but also the identity loss andL1distance function.We aim to minimize the sum of these four loss functions.

4 Experiment

4.1 Complexity Analysis

To decrease the computation costs,we used the Ghost module to replace the ordinary convolutional layer and thus obtain the same number of feature maps.Hence,the Ghost module can be combined into current network architectures.This cuts back on memory usage and speeds up operation,i.e.,there is one identity mapping andm×(s-1)=n s×(s-1)linear operations.The average kernel size of each linear operation is equivalent tod×d.We use linear operations of the same size (e.g.,3×3 or 5×5) to ensure the efficient implementation in a single Ghost module.The speed-up ratio of the upgrading ordinary convolution by the Ghost module is as shown below:

whered×dhas a similar magnitude ask×k,ands≤c.Similarly,the compression ratio can be expressed as

which is equal to that of the speed-up ratio by utilizing the Ghost module.

Table 1:Experimental environment configuration

4.2 Implementation Details

We used the content data set and oil painting data set from the CycleGAN training dataset.The oil painting data set had more than 8000 images and included four artist styles:Monet,Ukiyoe,Van Gogh,and Cezanne.The cartoon data set was also from CartoonGAN.We collected stick figure images from the internet and books.In our experiment,we used CartoonGAN,CycleGAN,and GANILLA as comparison methods.We compared different styles of images generated by these three generator models with those generated by our proposed method.The size of all images for training (i.e.,natural images and style images) was set to 256 × 256 pixels.We trained our models for 200 epochs and employ the Adam optimizer [33].The learning rate was set to be 0.0002 for the whole training process.PyTorch [34]was employed to implement our proposed method.The experimental environment configuration is shown in Tab.1.

Table 2:Comparison in terms of the number of parameters and FLOPs

Figure 6:Oil painting style results generated by CycleGAN,GANILLA,and our proposed method

5 Experimental Results

5.1 Style Transfer Results

To prove the effectiveness of our proposed method,we compared the number of parameters and the total FLOPs of the generator and discriminator for different generative models.We compared CartoonGAN,CycleGAN,and GANILLA with our proposed method.As shown in Tab.2,our proposed method has the lowest values in both evaluation indexes.This phenomenon shows that our proposed generative model can efficiently save computational costs.This benefit is due to the use of the Ghost module as a conversion network to generate more feature maps.Furthermore,since we utilized the attention mechanism to allow the network to learn more useful features,the network is efficient and lightweight.

Fig.6 shows the oil painting style results generated by CycleGAN,GANILLA,and our proposed method.Fig.7 shows the generated images of cartoon style for the three methods.We found that most of the results generated by our proposed method captured content and style successfully.

Figure 7:Cartoon style results generated by CartoonGAN,GANILLA,and our proposed method

We selected the Cezanne style for testing,and trained for 200 epochs.Fig.8 shows the image generated by the proposed model in different training periods.From left to right are the original images,and then the generated image at 5,50,100,150 and 200 epochs.These experimental results indicate that this range can achieve the best results between 50 and 100 epochs.This range not only retains the content information but also transfers the style.

Figure 8:Cezanne style results by our proposed method in different training periods

Based on the above styles transfer results,we carried out a series of experiments on three additional styles:animation,stick figure,and season.We compare our proposed method with GANILLA in terms of generated images in Fig.9.

5.2 Evaluation

We adopted two evaluation indicators in the assessment of style transfer.One is the subjective evaluation,which determines whether the image is generated well or not by personal cognition and aesthetics.The other is an objective evaluation,but since there is no clear objective evaluation standard for style transfer,most researchers compare their generated results with other experimental results as an evaluation method;we also adopt this approach.

5.2.1 Subjective Evaluation

The main factors affecting subjective evaluation are personal aesthetics and preferences.To this end,we designed a questionnaire.The questionnaire was sent to 200 participants,all of whom were computer graduate students with a foundation in drawing processing.The questionnaire focused on the point of view of aesthetics and the similarity between the generated image and the original image.We listed the different images generated by CycleGAN,GANILLA,CartoonGAN,and our proposed method,and then let the participants choose which one they thought was the best.Tabs.3-5 show that our model was evaluated as producing the best images.However,for cartoon style transfer,our model is not as good as CartoonGAN in visual aesthetics.

Figure 9:Different style results:(a) animation,(b) autumn,(c) winter,and (d) stick figure

5.2.2 Objective Evaluation

So far,there is no clear objective evaluation standard for style transfer because it is difficult to obtain quantitative data as an evaluation indicator of image style transfer.To evaluate the generated image more objectively,we apply the peak signal-to-noise ratio (PSNR) value to compare the generated image to the original image.The PSNR value is a common index for evaluating images and can measure the similarity between original images and generated images.Usually,we need to use the mean square error (MSE) to calculate the PSNR.The MSE can be expressed as follows:

whereYandXdenote the generated and the original images with the sizem×n,respectively.X(i,j)andY(i,j)are the pixel values ofXandY,respectively.The calculation of PSNR is given as

whereMAXIrepresents the maximum pixel value of the images that need to be calculated;smaller MSE values (i.e.,bigger PSNR values) indicate better image quality.

We use the structural similarity (SSIM) [35]as another evaluation index to measure the similarity of two digital images.Compared with PSNR,SSIM can be more in line with human judgments of image quality.SSIM is expressed as follows:

whereμXandμYare the mean values ofXandY,respectively.σ2Xandσ2Yare the variance values ofXandY,respectively.σXYdenotes the covariance betweenXandY.c1andc2are two small constants to ensure stability when the denominator becomes zero.

Table 3:Questionnaire results on oil painting style transfer

Table 4:Questionnaire results on cartoon style transfer

From Tabs.6 and 7,we can see that our proposed method has the largest SSIM and PSNR values.Hence,our proposed method is better than the other methods.These results clearly illustrate that the images generated by our proposed method have lower distortion and better image quality.

Table 5:Participants’voting results for seasonal style transfer and stick figure style transfer

Table 6:Quantitative results for oil painting style transfer

Table 7:Quantitative results for cartoon style transfer

6 Conclusion

In this paper,we proposed a lightweight style transfer network based on the Ghost module,which can reduce the number of parameters and FLOPs while ensuring the quality of generated images.We also introduced an attention mechanism into our proposed model to focus on more important content during the transfer process.The experimental results show that our proposed method has a comparable performance to other methods.Moreover,in terms of both efficiency and accuracy,our proposed method outperforms state-of-the-art lightweight neural architectures.Therefore,employing our architecture would significantly improve method performance in practical applications.In the future,we believe that designing a universal and efficient generator architecture for in image processing is worthy of study.

Funding Statement:This work was funded by the China Postdoctoral Science Foundation (No.2019M661319),Heilongjiang Postdoctoral Scientific Research Developmental Foundation (No.LBH-Q17042),Fundamental Research Funds for the Central Universities (3072020CFQ0602,3072020CF0604,3072020CFP0601),and 2019 Industrial Internet Innovation and Development Engineering (KY1060020002,KY10600200008).

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.