Removing highlights from single image via an attention-auxiliary generative adversarial network*

2022-07-06 13:45ZHAOXinchiJIANGCeHEWei
中国科学院大学学报 2022年4期

ZHAO Xinchi, JIANG Ce, HE Wei

(1 Key Lab of Wireless Sensor Network and Communication, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences,Shanghai 201800, China; 2 University of Chinese Academy of Sciences, Beijing 100049, China; 3 Chengdu ZhongKeWei Information Technology Research Institute Co Ltd, Chengdu 610000, China)

Abstract The highlights in the image will degrade the image quality to some extent. In this paper, we focus on visually removing the highlights from degraded images and generating clean images. In order to solve this problem, we present an attention-auxiliary generative adversarial networks. It mainly consists of the convolutional long short term memory network with squeeze-and-excitation (SE) block and the map-auxiliary module. Map-auxiliary can instruct the autoencoder to generate clean images. The injection of SE block and map-auxiliary module to the generator is the main contribution of this paper. And our proposed deep learning-based approach can be easily ported to handle other similar image recovery problems. Experiments prove that the network architecture is effective and makes a lot of sense.

Keywords GAN (generative adversarial networks); attention map-auxiliary; squeeze-and-excitation block; image restoration; highlights-removal

With the rapid development of smart phones, the camera function of mobile phones has become more and more powerful. People are happy to take photos of their favorite scenery anytime and anywhere. However, taking a good photo requires taking into account multiple factors such as lighting condition, especially highlights can interfere with buildings or people in an image. Principally, the degradation occurs because the area affected by the highlights contain different imageries from those without highlights.

In this paper, we address the particular situation where the image is impaired by highlights. Our goal is to remove the highlights and produce a clean image as shown in Fig.1. Using our method, people can take their own satisfactory photos more easily.

Fig.1 Demonstration of our highlight removal method

So far, many papers deal with image highlights-removal by traditional methods such as analyzing pixel color (highlight color)[1]or using bilateral filtering[2]. Few papers solved highlights-removal problems through neural networks. We believe that neural networks reduce the limitations of scenes and lighting types compared to traditional algorithms, and that is why we choose to use GAN (generative adversarial networks) to remove the highlights. We find image highlights-removal is similar to image raindrop-removal in algorithm flow (locate, restore). There are some methods proposed to tackle the raindrop detection and removal problems, which are helpful to us. Methods in such as Refs.[3-5] are dedicated to detecting raindrops. Some methods are introduced to detect and remove raindrops using stereo camera[6], video[7-8], or specifically designed optical shutter[9], and thus are not applicable for a single input image taken by a normal camera. Recently, a new network proposed by Qian et al.[10]performs well, which can detect the raindrops and remove them.

Generally, the highlights-removal problem is intractable. There are two main reasons. One is that the regions occluded by highlights are not given, the other is that the information about the background scene of the occluded regions is almost lost. Naturally, there are two corresponding steps to solve this problem. The first step is to locate the area affected by the highlights, and the second step is to restore the occluded area.

In our network, we propose an improved GAN[11], where the generator will be assessed by the discriminator to ensure that our generator produce good outputs. The first part of the generator is to locate the area affected by the highlights, so we build the attention module. It mainly consists of the SE-ResNet[12-13]combined with a convolutional long short term memory network (ConvLSTM)[14]. The second part is to remove the degraded area by our attention-auxiliary auto-encoder. The output of the attention model will instruct the network to remove highlights more accurately. We use the autoencoder to extract higher-order features and use the decoder to predict clean images. At last, our discriminative network will check if it is real enough. This kind of training method is like a game process, which can make both generator and discriminator get stronger.

Overall, based on existing networks, we have two main contributions, one is applying deep-learning method (GAN) to the image highlights-removal, the other is the injection of the SE block into the generative network.

1 The proposed network

We think the attentive GAN[10]is pretty effective for image rain-removal. However, for image highlights-removal, this network is not very suitable because of the different characteristics between raindrops and highlights. So, we improved the generator to fit our mission better. Figure 2 shows an overview of our proposed attention-auxiliary GAN.

Fig.2 Overview of our proposed attention-auxiliary GAN

1.1 Attention model

We employ a recurrent network to find the areas influenced by highlights. In our architecture, each time step consists primarily of a SE-ResNet and a ConvLSTM network.

1.1.1 SE-ResNet

SE-ResNet is one of the most essential parts of the attention model, it can be seen of as a combin-ation of SE blocks[12]and residual networks[13]. Since the features of highlights are very different from those of the background, the weights of the feature maps should also be different. SE blocks can do this very well.

A SE block is divided into three steps: squeeze, excitation, and scale. Firstly, squeeze is achieved by using global average pooling to generate channel-wise statistics. Formally, a statisticz∈cis generated by shrinkingUthrough its spatial dimensionsH×W, such that thec-th element ofzis calculated by:

(1)

To make use of the information aggregated in the squeeze operation, we follow it with a second operation which aims to fully capture channel-wise dependencies. We opt to employ a simple gating mechanism with a sigmoid activation:

s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z)).

(2)

(3)

1.1.2 ConvLSTM network

Long short term memory network (LSTM)[16]is a special recurrent neural network(RNN), capable of learning long-term dependencies. The key to LSTM is the cell state, which is kind of conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It is very easy for information to just flow along it and remain unchanged. And each time step consists of an input gateit, a forget gateft, an output gateot, and a cell stateCt.The ConvLSTM is essentially the same as the LSTM, taking the output from the previous layer as the input from the next layer. Different from the normal LSTM, the ConvLSTM is added with the convolution operation, which can not only get timing relationships, but also extract spatial features. At the same time, the operation between states is replaced by convolution operation. The key equations of ConvLSTM are shown below, where ‘*’ denotes the convolution operator, ‘∘’ denotes the Hadamard product,Xtis the input, andHtrepresents the output:

it=σ(Wxi*Xt+Whi*Ht-1+Wci∘Ct-1+bi),

(4)

ft=σ(Wxf*Xt+Whf*Ht-1+Wcf∘Ct-1+bf),

(5)

Ct=ft∘Ct-1+it∘tanh(Wxc*Xt+Whc*Ht-1+bc),

(6)

ot=σ(Wxo*Xt+Who*Ht-1+Wco∘Ct+bo),

(7)

Ht=ot∘tanh(Ct).

(8)

For this project, we set the time step of the ConvLSTM to 6. Feeding the original degraded image to this module, we can get the final attention mapM, which is a 2D matrix ranging from 0 to 1. And the higher the value, the more attention should be paid. The loss function of this part is defined as the mean squared error (MSE) betweenMandD, whereDis the difference between groundtruths and degraded images.

1.2 Map-auxiliary autoencoder

The purpose of this module is to use the previously obtained attention mapMto remove highlights and generate a clean image. This module consists of two parts: one is a normal autoencoder with skip connections, the other is an attention-auxiliary network. Figure 3 illustrates the architecture of our map-auxiliary autoencoder.

As shown in Fig.3, the darker part is the autoencoder module, while the grey part is the map-auxiliary module. We think the final attention map generated by the previous attention model is very meaningful to help the autoencoder remove the highlights. To make better use of the final attention map, instead of concatenating it with the input image directly, we feed the final attention map into the attention-auxiliary network and train independently to generate three auxiliary maps. We think these three auxiliary maps of different scales can instruct the autoencoder to remove the highlight by performing Hadamard product. And we describe this module by following formula:

Fig.3 The architecture of our map-auxiliary network

C=Φ(A∘I)+(E-A)∘I,

(9)

whereCis the clean image that we want,Ais the attention matrix that learned from attention-auxiliary module,Iis the degraded image andEdenotes the identity matrix of the same size asA. Φ represents a transformation process that can restore the highlight area. Of course, this transformation is exactly learned through our autoencoder and attention-auxiliary module. And the loss function of this module contains a perceptual loss[17]and a multi-scale loss[10].

1.3 Discriminator

We employ the attentive discriminator[8]as our discriminator. Different from conventional discriminators, the attentive discriminator uses the final attention map generated by our attention model, which uses a loss function based on the output of CNN and the attention map.

2 Experiments

2.1 Dataset and implementation

2.1.1 Dataset

Similar to current deep learning methods, our method requires relatively a large amount of data with groundtruths for training. However, since there is no such dataset for our project, we create our own. In our experiment, we need a set of image pairs, where each pair contains exactly the same background scene, yet one is degraded by highlights and the other is free from highlights. The important thing is and the other is free from highlights. The important thing is we need to manage any other causes of misalignment, such as camera motion, when taking the two images; and, ensure that the atmospheric conditions as well as the background objects to be static during the acquisition process. In the process of capturing image pairs, we found it difficult to shoot a pair of photos that were just different in lighting.

So, we decide to use Photoshop to simulate highlights in images. Firstly, we download 861 clean pictures which are taken with Canon from the Internet. Then we add highlights to the images randomly and generate 861 image pairs for training. And in this way we get an ideal training set. In addition, we use the same method to generate 200 image pairs as the test set.

2.1.2 Model implementation

The model is implemented with the TensorFlow framework and trained from scratch in a fully supervised manner. The learning rate is 0.002. Our attention auxiliary module has 9 convolution-ReLU blocks and the ratio of the SE blockris 16.

2.2 Results and analysis

2.2.1 Comparison results

There are four models in our experiment. The first model uses CycleGAN[18], which does not add attention mechanism. The second is the original attentive GAN[10], which does not contain the SE block and the map-auxiliary module. The third is an improved version of the original attentive GAN, which is part of our own network. In this model, we consider the channels with different given weights, and apply the SE block to the model. And the last one is the whole attention-auxiliary GAN, which has employed both SE block and map-auxiliary module. Table 1 shows the numerical comparison of the above three models.

Table 1 Quantitative evaluation results

As shown in the table, our model has a significant improvement in numerical value compared to the original model. We can find that the effect of direct mapping under the CycleGAN framework does not perform well numerically. Figure 4 shows the visual comparison of CycleGAN and our algorithm.

It can be drawn from Fig. 4 that the comparison of the recovery effects of CycleGAN and our algorithm, which illustrates the poor recovery details of CycleGAN. As far as our attention-auxiliary GAN is concerned, it can accurately locate and remove areas affected by highlights. Thus, we can infer that the attention network improves the algorithm’s ability to recover details by finding the region of interest. Figure 5 shows the visual comparison of three models with attention mechanisms.

Fig.4 Results of comparing CycleGAN and our algorithm

In addition, in order to display the effect of our attention module in a real scene, we feed the real images degraded by highlights into our network.

Fig.5 Results of comparing a few different methods

Figure 6 shows the region of interest extracted by the attention module in a real scene. The red area in the figure indicates the area affected by the highlights, which is also the area we should pay attention to.

2.2.2 Application performance

In order to measure the performance in practical application, we also test our model by employing Google Vision API, which can reflect the recognition performance of our outputs. As shown in Fig.7, Google Vision API can not frame the building in the original degraded image, while it can frame the building in our output. Therefore, our method has a good improvement for target recognition.

In addition, our algorithm can remove highlights in indoor scenes, thereby improving the accuracy of indoor video detection algorithms (fall detection). Figure 8 shows the effect of our algorithm in indoor scenes, which has a certain meaning in real life.

Fig.6 The region of interest extracted by the attention module

Fig.7 The performance of a pair of images judged by Google Vision API

Fig.8 The performance of our algorithm in indoor scenes (fall detection)

3 Conclusion

This paper targets the problem of removing the highlights from single image. We have proposed a novel attention-auxiliary GAN. Different from the existing networks, our proposed network feeds the accurate attention map into the map-auxiliary module, which can help the autoencoder generate a clean image. The attention-auxiliary module makes the GAN accurately locate the region of interest, and then remove the highlights in the image more accurately.

More potentially, we can expand our training set to various types of highlights, such as highlights caused by street lights, car headlights or camera flash. Therefore, we believe that our deep learning-based method image highlights-removal has a wide range of application scenarios, which makes sense in real life.