ZHA Jianhong(查剑宏), YAN Cairong(燕彩蓉)*, ZHANG Yanting (张艳婷)*, WANG Jun(王 俊)
1 College of Computer Science and Technology, Donghua University, Shanghai 201620, China 2 College of Fashion and Design, Donghua University, Shanghai 200051, China
Abstract:The demand for image retrieval with text manipulation exists in many fields, such as e-commerce and Internet search. Deep metric learning methods are used by most researchers to calculate the similarity between the query and the candidate image by fusing the global feature of the query image and the text feature. However, the text usually corresponds to the local feature of the query image rather than the global feature. Therefore, in this paper, we propose a framework of image retrieval with text manipulation by local feature modification (LFM-IR) which can focus on the related image regions and attributes and perform modification. A spatial attention module and a channel attention module are designed to realize the semantic mapping between image and text. We achieve excellent performance on three benchmark datasets, namely Color-Shape-Size (CSS), Massachusetts Institute of Technology (MIT) States and Fashion200K (+8.3%, +0.7% and +4.6% in R@1).
Key words:image retrieval; text manipulation; attention; local feature modification
Image retrieval is a research hotspot in computer vision. Current image retrieval systems usually take a text or an image as input, namely text-to-image[1]and image-to-image retrieval[2]. However, in actual scenarios, a single image or plain text usually cannot accurately express the user’s intentions. Therefore, many researchers now integrate other types of data into the query in order to improve the retrieval performance, such as attribute[3-4], spatial layout[5], or modification text[6-8]. In this paper, we focus on the task of image retrieval with text manipulation. This retrieval mode is common in many fields. For example, in the field of fashion retrieval[9], users can find the fashion products that they want more accurately by inputting an image and a description.
The challenges of image retrieval tasks with text manipulation are mainly reflected in two aspects. Firstly, since different modal data exist in different feature spaces with different representations and distribution characteristics, the model should be able to handle the modal gap between the query and the target image. Secondly, the model not only needs to understand the text and image, but also needs to associate the visual features of the image with the semantic features of the text.
Considering the above challenges, the representation of the query is obtained by most researchers[6-8]through a certain method of feature fusion, and then the similarity between the query and the candidate images is computed through deep metric learning methods[10]. For example, text image residual gating (TIRG) model[6]modifies the image feature via a gated residual connection to make the new feature close to the target image feature. Joint visual semantic matching (JVSM) model[7]learns image-text compositional embeddings by jointly associating visual and textual modalities in a shared discriminative embedding space via compositional losses. The common point of these models is to obtain the representation of the query by fusing the global feature of the image and the feature of text. However, the desired modification usually corresponds to the local feature of the query image rather than the global feature, so the model should learn to modify the local feature of the query image.
Attention mechanism[11-12]is a very hot technology in recent years, which is widely used in various fields of deep learning, for example, image retrieval[13-14]and scene segmentation[15]tasks. In this paper, we use attention mechanism to focus on the local feature of the query image, and then modify it. Specifically, we first use a spatial attention module to locate the regions to be modified, and then use a channel attention module to obtain the specific attributes to be modified, and finally modify them to obtain the modified query image feature.
The main contributions can be summarized as follows.
1) We propose an image retrieval framework based on local feature modification (LFM-IR) to deal with the task of image retrieval with text manipulation.
2) We design a spatial attention module to locate the image regions to be modified and a channel attention module to focus on the attributes to be modified.
3) We achieve excellent performance for compositional image retrieval on three benchmark datasets, Color-Shape-Size(CSS), Massachusetts Institute of Technology(MIT) States, and Fashion 200K.
Given a query imagexand a modification textt, LFM-IR modifies the local feature of the query image through the text feature to obtain the modified image featuref(x,t)=φxt∈RC, whereCis the number of feature channels. The modified image feature does not change its feature space. Therefore, for the target image, we can extract its feature in the same way, and then calculate the similarity between them through a similarity function. Figure 1 shows the structure of LFM-IR. The network consists of feature extraction module, spatial attention module, channel attention module and feature modification module.
Fig.1 Structure of LFM-IR
1.1.1Featureextractionmodule
For a query imagexand a target imagey, we use the ResNet-18 pre-trained on ImageNet to extract their features. To preserve the spatial information of the query imagex, we remove the last fully connected layer when extracting the feature of the query image. Therefore, the representation of the query image isφx∈RW×H×C, whereWis the width andHis the height. The feature of the target imageyis extracted with the complete ResNet-18 and represented byφy∈RC.For the modification textt, we use the standard long short-term memory (LSTM) to extract its semantic information, which is represented byφt∈RC.
1.1.2Spatialattentionmodule
Considering that the modification text is usually related to some specific regions of the query image, we only need to modify the related regions which refer to the image regions that need to be modified. And the related regions are not fixed, so we propose a spatial attention module to adaptively capture the related regions of the query image. Given a query imagexand a modification textt, we need to obtain the spatial attention weightsαs∈RW×H.Specifically, we transform the text featureφtto make it the same dimension as the query image featureφx.We can obtain the transformed text featureφ′t∈RW×H×Cthrough spatial duplication, and then the spatial attention weightαsis computed as
αs=σ(Lc(*φx⊙φ′t)),
(1)
where * represents batch normalization layer, ⊙ represents the element-wise multiplication,Lcindicates a convolutional layer that contains a 3×3 convolution kernel, andσis the Sigmoid function. We mark the regions of the query image that need to be modified asR1, and the regions of the query image that do not need to be modified asR2.With adaptive attention weights, the feature ofR1can be computed as
φR1=L1(Favg(αsφx)),
(2)
and the feature ofR2can be computed as
φR2=L1(Favg((1-αs)φx)),
(3)
whereφR1andφR2are the features ofR1andR2, andφR1∈RC,φR2∈RC.Favgrepresents a global average pooling layer, andL1is a fully connected layer.
1.1.3Channelattentionmodule
Although the spatial attention module can focus on the related regions of the query image, there are many attributes associated with the modification text in the regions. For example, we only modify the size or color or shape of a certain entity in the regions. Therefore, after focusing on the related regions, we need to focus on the related attributes which refer to the attributes that need to be modified. Specifically, we first fuse the text featureφtand the feature ofR1through a simple concatenation, and then feed it into a fully connected layer to obtain the channel attention weightαc∈RC.Formally, the attention weightsαcis computed as
αc=σ(L2(δ(*[φR1,φt]))),
(4)
whereL2is a fully connected layer, [] represents concatenation operation, andδis the ReLU function. With adaptive attention weights, the related attributesφA1∈RCcan be represented as
φA1=αc⊙φR1
.
(5)
And the unrelated attributesφA2∈RCcan be represented as
φA2=(1-αc)⊙φR1.
(6)
1.1.4Featuremodificationmodule
After determining the related regions and attributes, we need to modify the related attributes. Specifically, we first fuse the text featureφtand the related attributesφA1inR1through a simple concatenation, and then feed it into the multilayer perceptron (MLP) to obtain the modified attributes representationφ′A1∈RC.Formally,φ′A1is calculated as
φ′A1=LM([φA1,φt]),
(7)
whereLMis the MLP. Then we mergeφ′A1with the unrelated attributesφA2to obtain the feature of the modifiedR1, which is denoted asφ′R1∈RC.Formally,φ′R1is calculated as
φ′R1=w1φ′A1+w2φA2,
(8)
wherew1andw2are learnable weights. Finally, we fuse the features of the two regions to obtain the modified query image featureφxt:
φxt=w3φ′R1+w4φR2,
(9)
wherew3andw4are learnable weights.
Following TIRG[6], we employ two different loss functions, namely the soft triplet loss and the batch classification loss. The soft triplet loss is defined as
(10)
wheres() represents the similarity function, andBis the batch. Following TIRG[6], we use the dot product as the similarity function. The batch classification loss is defined as
(11)
According to the experience of predecessors[6], we employ the soft triplet loss for the CSS[6]and MIT States[16]datasets, and for the Fashion200K[17]dataset we employ the batch classification loss.
In this section, we conduct extensive experiments to evaluate LFM-IR. The experiments answer the following three questions.
1) RQ1:How does LFM-IR perform on these datasets?
2) RQ2:Can the spatial attention module accurately focus on the regions that needs to be modified?
3) RQ3:How do different attention modules improve model performance?
To evaluate LFM-IR, we conduct extensive experiments on the three datasets of CSS[6], MIT States[16]and Fashion200K[17]. A short description for each dataset is as follows.
CSS[6]is a synthetic dataset containing complex modification texts. It contains about 16 000 training queries and 16 000 test queries, each query contains a query image, a modification text and a target image. Each image contains some geometric objects of different shapes, sizes and colors, and there are three types of modification texts:adding, removing or changing object attributes.
MIT States[16]dataset contains approximately 60 000 images, each image contains a noun label and an adjective label, which respectively represent the object and the state of the object in the image. The query image and the target image have the same noun label but different adjective label, and modification text is the desired state of the object. The training set contains about 43 000 queries and the test set contains about 10 000 queries.
Fashion200K[17]is a dataset in the fashion field. It contains about 200 000 images of fashion products, and each image has a corresponding multi-word fashion label. The label of the query image and the label of the target image differ by only one word, and modification text is a description of the label difference. The training set contains about 172 000 queries, and the test set contains about 31 000 queries.
We choose the following models as the baselines, including Image only[6], Text only[6], Concatenation[6], Relationship[18], Multimodal Residual Networks (MRN)[19], Feature-wise Linear Modulation (FiLM)[20], Show and Tell[21]and TIRG[6]. In addition, we compare LFM-IR with Joint Attribute Manipulation and Modality Alignment Learning (JAMMAL)[8]. The metric for retrieval is recall at rankK(R@K,K={1, 5, 10, 50}), which represents the proportion of the correct labeled image in the topKretrieved images. Each experiment is repeated 5 times for a stable experimental result, and the mean and standard deviation are reported.
We evaluate the retrieval performance of existing models and LFM-IR on three benchmark datasets:CSS, MIT States and Fashion200K, the results are shown in Table 1, and the best number is shown in bold. Besides, Fig.2 shows some qualitative examples of LFM-IR, each row shows a text, a query image, and the retrieved images. The examples are from CSS, MIT States, and Fashion200K datasets, respectively. And the green rectangle indicates the correct target image.
Fig.2 Some qualitative examples of LFM-IR:(a) CSS; (b) MIT States; (c) Fashion 200 K
Column 2 in Table 1 shows theR@Kperformance on CSS dataset, we use the 3D version of dataset (both query image and target image are 3D). And we only report the measuredR@1, becauseR@1 is good enough. The results show that LFM-IR performs the best, LFM-IR gains an 8.3% performance boost in terms ofR@1 compared to JAMMAL.
Columns 3-5 in Table 1 show the results of LFM-IR and other models on MIT States dataset, and we report theR@K(K={1, 5, 10}) of different models on this dataset. The results show that LFM-IR is superior to most models and is comparable to the state-of-the-art model. In particular, compared with JAMMAL, LFM-IR achieves a 0.7% performance improvement in terms ofR@1.But in terms ofR@5 andR@10, JAMMAL performs better than LFM-IR. We believe that this is because the text information of this dataset is too simple, causing the attention mechanism to be unable to accurately focus on the regions that need to be modified. In addition, all models except JAMMML use ResNet-18 as the image encoder, while JAMMML uses ResNet-101 as the image encoder. Therefore, the image features extracted by JAMMAL are more expressive.
Columns 6-8 in Table 1 show the results on Fashion200K dataset, we report theR@K(K={1,10,50}) of different models on this dataset. The results show that although the modification text of this dataset is relatively simple, LFM-IR is still better than the state-of-the-art model. And compared with JAMMAL, LFM-IR gains a 4.6% performance boost in terms ofR@1.
Table 1 Performance on CSS, MIT States and Fashion200K datasets
To further explore the learning situation of the model, we visualize the spatial attention module in LFM-IR on the CSS dataset, because the text information of the CSS dataset is richer than other datasets. As shown in Fig.3, the spatial attention module pays more attention to the regions that need to be modified, while paying less attention to the regions that do not need to be modified. The experiment shows that the spatial attention module can accurately focus on the regions that need to be modified.
Fig.3 Visualization of spatial attention module:(a) examples of removing object; (b) examples of changing attributes; (c) examples of adding object
To explore the effects of spatial attention and channel attention modules on the proposed method, we conduct ablation studies on three benchmark datasets and chooseR@1 as the evaluation metric. After removing the spatial attention and channel attention modules, the LFM-IR model degenerates into the concatenation model. The results are shown in Table 2.
Table 2 Retrieval performance of ablation studies
The results show that, compared with the concatenation model, adding the spatial attention module or the channel attention module can significantly improve the performance. Adding two attention modules at the same time has better retrieval performance than adding a single spatial attention or channel attention module.
In this paper, we propose a novel method based on local feature modification for image retrieval with text manipulation. Through a unified feature space, texts are mapped to different image regions so that we can find where to modify and what to modify. The semantic mapping relationship between text and image is established by the proposed spatial attention module and channel attention module. Extensive experiments are conducted on three benchmark datasets, showing the superiority of LFM-IR. In the future, we will try to use more complex feature extraction networks to enhance the expression ability of feature and further improve the performance of LFM-IR.
Journal of Donghua University(English Edition)2023年4期