Yan Ouyang, Xin-qing Wang, Rui-zhe Hu, Hong-hui Xu
Department of Mechanical Engineering, College of Field Engineering, Army Engineering University of PLA, Nanjing, 210007, China
Keywords:Few-shot learning Object detection Sample augmentation Attention mechanism
ABSTRACT Traditional object detectors based on deep learning rely on plenty of labeled samples, which are expensive to obtain.Few-shot object detection(FSOD)attempts to solve this problem,learning detection objects from a few labeled samples, but the performance is often unsatisfactory due to the scarcity of samples.We believe that the main reasons that restrict the performance of few-shot detectors are: (1)the positive samples is scarce,and(2)the quality of positive samples is low.Therefore,we put forward a novel few-shot object detector based on YOLOv4, starting from both improving the quantity and quality of positive samples.First, we design a hybrid multivariate positive sample augmentation (HMPSA)module to amplify the quantity of positive samples and increase positive sample diversity while suppressing negative samples.Then,we design a selective non-local fusion attention(SNFA)module to help the detector better learn the target features and improve the feature quality of positive samples.Finally,we optimize the loss function to make it more suitable for the task of FSOD.Experimental results on PASCAL VOC and MS COCO demonstrate that our designed few-shot object detector has competitive performance with other state-of-the-art detectors.
Object detection is one of the core tasks of computer vision,and the main task is to classify and locate objects in images.For the past few years, thanks to the extensive research and improvement of deep learning methods, the task of object detection has achieved unprecedented breakthroughs.Various deep learning based object detection methods[1-5]have been proposed one after another and gradually replaced traditional methods to become the mainstream in this field.However, the prosperity of these deep learning based methods in object detection tasks largely relies on plenty of annotated training data.A good deal of training data mainly comes from manual annotation,which requires plenty of time and human resources, and the cost is high [6].Moreover, in some special application scenarios, such as the detection of rare species or military targets,even a large amount of image data cannot be obtained.When the detector cannot obtain enough annotated training data,the detection model tends to be severely overfitted, resulting in degraded model performance and poor generalization ability.By comparison, humans can quickly learn new objects from a few examples [7]: for example, a human child can learn to recognize new objects that have never been seen in reality(such as a giraffe)from just one or a few pictures.Inspired by the human learning perspective [8], the concept of few-shot learning [9,10] was proposed and received extensive attention.
Affected by complex environments (such as lighting, occlusion,etc.), objects in the real world have different shapes, and few-shot learning is a challenging task[11].In recent years,although various methods on few-shot learning[12-19]have emerged and achieved good results, most of them focus on solving image classification problems, and there are few methods involving few-shot object detection(FSOD).Object detection not merely requires to complete the task of category prediction, but also requires to accurately locate the object, which is harder than the simple image classification task.At present,for the exploration of FSOD methods,most of them inherit the idea of few-shot classification [7,11,20,21].Different from previous methods, for the problem of FSOD, we believe that from the perspective of samples, improving the performance of the detector by improving the quantity and quality of positive samples is a new idea worth studying.
Our paper puts forward a new general FSOD model by improving positive samples.The main contributions of our paper are displayed as follows:
(1) We design a hybrid multivariate positive sample augmentation module according to the actual situation, which can effectively reduce the risk of overfitting and improve the generalization performance of the model by expanding the annotated sample capacity and improving the sample distribution in a targeted manner.
(2) We design a selective non-local fusion attention (SNFA)module to enhance the target feature expression and suppress other interfering features, thereby improving the feature extraction quality of samples for better detection performance.
(3) We optimize the loss function of the detector.By introducing an orthogonal loss and modifying the original classification loss term, the model is more suitable for the task of FSOD.
(4) Experimental results on PASCAL VOC and MS COCO data sets show that the FSOD model designed by us has good detection performance, and is superior to other FSOD models in the case of scarce samples, which can better adapt to the FSOD task.
The rest of this paper is organized as follows:Section 2 reviews related work on FSOD.Section 3 details our raised model.Section 4 introduces the implementation details and analyzes the experimental results.Finally,Section 5 concludes the work of this article.
Object detection has always been a research hotspot in the field of computer vision.Especially with the rise of deep learning [22],the object detection method based on CNN has attracted the attention and research of many scholars.The current mainstream object detection models can be split into two types: two-stage detectors based on proposals and one-stage detectors without proposals.The R-CNN series [1-3] is representative of two-stage detectors, which produce several ROI proposals that may contain objects in the first stage,and then use bounding box regression and classifier to detect objects on the proposed area in the second stage.While one-stage detectors, such as the YOLO series [4,23-25] and SSD [5], do not need to generate ROI proposals and can directly classify and locate objects.Generally speaking, one-stage object detectors are much faster than two-stage object detectors,but have relatively poor detection accuracy, which is determined by their structure.Although these detectors can achieve excellent detection performance, they are all based on plenty of annotated training samples,and it is difficult to learn to detect novel classes through a few samples.
The task of few-shot learning(FSL)is to learn novel classes from a small amout of training examples.So far,a great deal of FSL work has investigated how to solve this novel classes generalization problem with scarce samples.At present,the main FSL methods can be split into three types: transfer-learning based, metric-learning based and meta-learning based.Transfer learning is to solve new problems by exploiting previously learned knowledge [26].Early transfer-learning based FSL methods [10,27,28] used a Bayesian framework to transfer prior knowledge of previously learned categories to novel categories.Hariharan et al.[29] proposed a fewshot transfer learning method combined with deep neural networks.Blaes et al.[30] proposed a learning process based on a global feature layer (Global Prototype Learning, GPL) for few-shot learning.Bauer et al.[31] proposed DLPM to transfer representative knowledge and classifier weights to FSL.Metric learning achieves classification by learning the distribution of distances between samples.Triplet network [32] induces similarity with respect to triplet images through a deep ranking neural network.Vinyals et al.[15]proposed a matching network based on memory and attention mechanism, which can identify unlabeled data using the matching metric of the data in the attention embedding.The prototype network [12] learns a metric space for classification via calculating the distance to the prototype representation of each class.Meta-learning elevates learning from the data to the task level, and can generalize well to new tasks by training in the labeled task space.Ravi et al.[13]Proposed a metalearner based on LSTM for few-shot learning.Finn et al.[16] proposed a model-agnostic meta-learning algorithm (MAML) to reduce training steps by learning better initial parameters.Li et al.[18].proposed a stochastic gradient descent (Meta-SGD) based meta-learner,which overcame the difficulty of Meta-LSTM training.Lee et al.[33]analyzed the best conditions’implicit differentiation for the convex problem of linear classifiers,allowing the network to embed high-dimensional features for fast generalization.However,most of these works are for few-shot classification, which are unable to use directly for FSOD.
For the past few years, some FSOD methods have emerged one after another,but most of the existing methods are inherited from the few-shot classification task.Chen et al.[20] combined the characteristics of Faster R-CNN [3] and SSD [5] on classification tasks and bounding box regression tasks, and put forward a lowshot transfer detector (LSTD) based on knowledge transfer regularization and background suppression regularization, which utilizes knowledge of source and target domains respectively to strengthen the model's generalization ability to few-shot data.But it relies heavily on source domain knowledge and is less robust.RepMet[21] introduces prototype clustering, constructs prototype space to extract image embedding features, and uses Euclidean distance to compute the distance between embedding features and support set class representation.But this network ignores the object location and the performance is not outstanding.Kang et al.[7]used a meta-weighted model to fuse the support set features to the query set features in a channel multiplication manner, which requires an additional mask branch and increases the computational complexity.Fan et al.[11] combined an attention-based region proposal network (RPN) with a strategy of comparative training,using the idea of a matching network[15]to calculate the similarity between query instances and support instances.P′erez-Rúa et al.[34]proposed a meta-learning method for central point prediction by referring to the structure and ideas of CenterNet[35].This model can realize incremental learning, that is, no need to access base class data after adding new classes.Hu et al.[36]designed a Gabor convolutional kernel-based method for few-shot object detection,which achieved better accuracy and recall on few-shot datasets.MPSR [37] and SAR [38] studied the multi-scale problem in FSOD,and designed a feature pyramid module to generate multi-scale features and refine them at different scales.Zhang et al.[39]combined the recently popular Transformer [40] with metalearning, and proposed an image-level meta-learning FSOD model.Different from the above work, we try to focus on a few number of given positive samples instead of relying on the relation with the outside world, which could be labile.
In order to better expand the discussion of this paper,we restate the protocols of the FSOD task as follows:The classes in all training data are separated into base classes and novel classes,and the two classes of data do not overlap each other.The base class includes plenty of labeled training data, while the novel class has only a small amount of labeled training data available.The training of FSOD is generally divided into two phases.In training phase 1, a large amount of data containing only the base class is used for basic training of the detection network.Then, in training phase 2, the detection network is fine-tuned by using a small amount of balanced training data containing the novel and base classes.The final trained detection network can be used to detect novel classes.The flow chart of FSOD is displayed in Fig.1.
Detecting objects of novel classes as accurately as possible by using a few number of samples is a challenging task.In this paper,we design a FSOD model based on YOLOv4[25]for addressing this challenge.The overall network structure of our detection model is displayed in Fig.2.Overall,our model is divided into two parts:the basic detection branch and the reinforcement branch.To satisfy the balance between detection accuracy and detection speed,we adopt the one-stage detector YOLOv4 as the baseline.The basic detection branch is mainly implemented through the network structure of YOLOv4.The reinforcement branch is mainly implemented through a parallel Siamese network structure that shares weights with the basic branch.The advantage of using Siamese network structure is that it can maintain the continuity of training and inference.As shown in Fig.2, in the inference process, we only need to use the basic detection branch at the bottom, while the reinforcement branch at the top can be removed.Specifically, we first preprocess the novel class images using our designed HMPSA module to increase the diversity and difference of the novel class samples.The augmented sample images are then sent to the reinforcement branch for feature extraction and prediction.During feature extraction,we optimize the feature representation of the target by using a SNFA module,so that the network can extract better novel class target features.As shown at the bottom of Fig.2, our SNFA includes two parallel branches,namely the spatial attention branch and the channel attention branch.The attention features generated by the two branches are sent to 1×1 convolution layer to reshape the channel size after concatenate fusion.Finally, the result obtained by the reinforcement branch is sent to the optimized loss function together with the result of the basic branch.
Following the basic setup of transfer learning-based FSOD models, our training process is separated into two stages: in the first stage, only the base class is trained on the basic detection branch, following the training process of YOLOv4, while the reinforcement branch is frozen.In the second stage,the network is finetuned, except for the detection head used for bounding box regression and classification, other weight parameters are frozen,that is, the second stage is initialized with the weight parameters obtained from the first stage training to perform the transfer of base class knowledge.Fine-tuning is only for the last layer, using a balanced dataset containing all novel class samples and an approximate number of base class samples as training data,avoiding problems of overfitting and poor generalization performance.During inference,the reinforcement branch is removed,and the data of novel classes without labels get detection results through the basic detection branch.
Fig.1.Flow chart of few-shot object detection.
Fig.2.Network structure of our proposed YOLOv4-based few-shot object detection model.SAM: spatial attention module, CAM: channel attention module.
We chose YOLO v4 as the base detector in order to balance detection speed and accuracy.As shown at the bottom of Fig.2,YOLOv4 consists of three components: backbone, neck and detection head.The backbone adopts CSPDarknet for feature extraction.In our detection model,we use SNFA to optimize the backbone for better feature extraction performance.The neck uses a modified path aggregation network (Modified-PAN) for feature fusion.The top-down and bottom-up feature fusion mechanism can augment the semantic information of features and obtain more accurate location information.After fusion, the three-layer features with different resolutions are sent to the detection head for classification and localization.
In the task of FSOD, the number of labeled positive samples is very rare, which cannot be fully trained as in traditional object detection, which is the key to what plagues few-shot detector performance.The scarcity of samples makes it impossible for the network model to learn comprehensive target information.When the target environment changes greatly (such as scale changes,illumination changes, and occlusions), the detector often has false detections or even missed detections.For this problem, sample augmentation methods can provide help.However, the object processed by the traditional sample augmentation method is the entire image,and a large amount of background information will be introduced when the positive sample is augmented.There are often a lot of interference information (such as negative samples irrelevant to the target) in these backgrounds.Adding more negative samples while increasing the number of positive samples will lead to more imbalance between positive and negative samples, which is not wise, especially in few-shot detection tasks.
In response to the above problems, we design a hybrid multivariate positive sample augmentation(HMPSA)module to amplify the number of our positive samples while suppressing the interference of negative samples.Without increasing the cost of additional labeling, our HMPSA module starts from the data domain,increases the capacity of labeled positive samples in a targeted manner,and enriches the training samples,thereby improving the performance of the detection model.
As displayed in Fig.3, our HMPSA module is mainly separated into two major steps: background sparsity and sample augmentation.Firstly,in order to suppress the negative samples’interference,we sparse the background of the image and extract the positive sample region from the image.The positive sample region takes the ground truth box as the main body and expanded outward by 20 pixels to simulate common scenes.The rest of the image is used as the background and its pixel value is set to 0.Then,to increase the diversity of positive samples, we perform hybrid multivariate augmentation on the positive sample regions.In order to fully simulate the environment of the sample and enhance the robustness of the detection model, we employ five augmentation submodules.Specifically, for different lighting environments (such as cloudy days,sunny days,dusk,etc.)where the target is located,we use HSV color space augmentation to augment the sample data.We use rotation augmentation to augment sample data for different geographic environments where the target is located.For different scales and positions of the target, we use scaling and translation augmentation to increase the sample data.To simulate blurring during image acquisition, we augment the sample data using blur augmentation.In order to simulate various occlusion situations in the actual environment, we adopt object-aware random erasure augmentation [41] to increase the sample size.Fig.3 shows the workflow of our designed positive sample augmentation module.First, the annotated positive sample regions first implement HSV color space augmentation,rotation augmentation,and scaling and translation augmentation respectively.Then, the data augmented by these sub-modules is augmented a second time, which is subjected to blur augmentation and object-aware random erasure augmentation [41].Finally, the augmented data of the first three sub-modules and the augmented data of the last two sub-modules are output together.The specific augmentation operations of each sub-module are as follows:
(1) HSV color space augmentation:The representation space of the original image is RGB color space, we first convert the original image from RGB space to HSV space
where (h,s,v) represent the three components of Hue, Saturation and Value in HSV space, respectively.(r,g,b) represent the Red,Green and Blue components of the RGB space,respectively.max =max(r,g,b),min = min(r,g,b).
Then, we add different degrees of random disturbance to the three channels of (h,s,v) for tone transformation, so as to expand the samples.Specifically,we divide the random disturbance weight W into four ranges(w1,w2,w3,w4)to simulate the transformation of four different lighting environments.Where w1∊[ - 1, - 0.5),w2∊[ - 0.5,0),w3∊[0,0.5),w4∊[0.5,1].The HSV channel components (hg,sg,vg) of the sample after increasing the disturbance are
where λh,λs,λvare the disturbance coefficients corresponding to the three channels (h,s,v) respectively.The random disturbance weight W is randomly selected within the defined four sub-ranges,and each sub-weight corresponds to an augmented sample.
(2) Rotation augmentation: In different geographical environments,the poses of the objects in the image are different and diverse.To meet the pose diversity requirement of the sample object, we design 6 poses through rotation augmentation.They are 18°, 45°, 72°clockwise rotation and 18°, 45°, 72°counterclockwise rotation, respectively.
Fig.3.Hybrid multivariate positive sample augmentation module: (a) background sparsity; (b) HSV color space augmentation; (c) rotation augmentation; (d) scaling and translation augmentation; (e) blur augmentation; (f) object-aware random erasure augmentation.The red box represents the ground truth box, and the blue box represents the expanded positive sample region.
(3) Scaling and translation augmentation: For the diversity of scales and locations of objects, we augment the samples by scaling and translation augmentation.First, multi-scale scaling is performed on the input image, and the scope of scaling is limited to: the size of the positive sample area is smaller than the original input image, and n samples of different scales are obtained.Then,we randomly crop these n samples to the same size as the original input image,and the cropped region does not contain the object.Each sample is randomly cropped for m times, so that a total of n× m samples with different scales and location distribution can be obtained.
(4) Blur augmentation: We simulate the blurring of objects in image samples in real environments by blurring the images.We use a Gaussian distribution to calculate the value of each pixel in the region, with a blur radius of 5 and a variance of 1.5.
(5) Object-aware random erasure augmentation [41]: We select a random erase area for the object bounding box to achieve sample amplification for occlusion situations, and the specific operation is borrowed from Ref.[41].
The traditional object detection method learns the feature distribution of the object through the training of abundant positive samples, and distinguishes different classes according to the feature distribution law.For the task of FSOD,positive samples are scarce, and it is difficult to grasp the object features through traditional training methods.In response to this problem, we consider to enhance the feature expression of the object by designing a selective non-local fusion attention (SNFA) module,suppress other interfering features, start with the feature quality,and improve the performance of the object detector by learning high-quality positive sample features.Compared with the traditional attention mechanism,the SNFA we designed starts from both local feature saliency and global feature correlation, which can allocate attention more reasonably,highlight the key features of the object, and exclude the interference of background feature at the same time.
The SNFA we designed contains two sub-modules, a spatial attention sub-module and a channel attention sub-module, which are used to tell the neural network to focus on“where”and“what”[42].Fig.4 visualizes the workflow of the two attention submodules.
where msijrepresents the influence of the j th position on the i th position.At the same time,we create a novel feature C∊RC×H×Wby feeding the feature X into the convolutional layer.Likewise, we reshape C into C∊RC×M.Then perform matrix multiplication on C and the transpose of Msto get the correlation feature D∊RC×M,and then reshape D into D∊RC×H×W.In the branch at the bottom, we first perform the maximum pool and average pool operations [42]on the input feature X∊RC×H×Walong the channel axis, concatenate the obtained results and send them to the 7×7 convolution layer and sigmoid to obtain the spatial importance weight Ws∊R1×H×W.Then perform element-wise multiplication of Wsand the spatial pixels of X to obtain the importance feature E∊RC×H×W
where Θ represents element-wise multiplication,σ represents the sigmoid function, and c7×7represents the 7×7 convolution operation.Finally,we weighted the features of the two branches to get the final output feature Ys∊RC×H×W:
Fig.4.Schematic diagram of selective non-local fusion attention.Two attention sub-modules are included: (a) spatial attention module; (b) channel attention module.
where α and 1-α are the scale coefficients of D and E,respectively,α is initialized to 0.5, and the weight gradually updated through learning.
(2) Channel attention module:Likewise,our channel attention module also contains two branches.As shown in Fig.4(b),in the top branch,for the input feature X∊RC×H×W,we reshape X into X∊RC×M,then perform matrix multiplication on X and the transpose of X,and obtain the channel correlation matrix Mc∊RC×Cthrough a softmax layer
where mcijrepresents the influence of the j th channel on the i th channel.At the same time,we conduct matrix multiplication on the transpose of Mcand X to obtain the correlation feature F∊RC×M,and then reshape F into F∊RC×H×W.In the bottom branch,we first perform max-pooling and average-pooling operations [42] on the input features X∊RC×H×Walong the spatial axis, and send the obtained results into two fully connected layers.Perform elementwise summation on the output of the fully connected layer and send it to sigmoid to get the channel importance weight Wc∊RC×1×1.Then perform element-wise multiplication of Wcwith the channel of the input feature X to obtain the importance feature G∊RC×H×W.
where fc1and fc2represent two fully connected layers,respectively,and ρ represents the ReLU function.Finally,the features of the two branches are also weighted and fused to get the final output feature Yc∊RC×H×W:
where β and 1-β are the scale coefficients of F and G,respectively,β is initialized to 0.5, and the weight gradually updated through learning.
The loss function is the objective function of the detection network learning and updating.A good loss function can often effectively boost the performance of model.In this paper, the training of the network is separated into two processes: the pretraining process based on the base class and the fine-tuning process based on the novel class(with the appropriate base class added to balance the positive and negative samples).The pre-training process is no different from traditional object detection.In this stage, our loss function follows the original design of yolov4.However, in the process of fine-tuning, due to the scarcity of annotated samples, the detection head of the network is difficult to learn reliable feature representation like general detection task.In this case,it is easy for the classifier to confuse the novel class target with other similar base classes,resulting in a large number of false detections and hurting the performance of the model.The research in Ref.[45] shows that in the FSOD task, the bounding box regression task can achieve good results after being fully pretrained on the base class data and then fine-tuned on the novel class.Likewise, the confidence loss measures the reliability of the predicted bounding box, independent of the number of positive samples.Therefore, for the fine-tuning process for few-shot detection, we only optimize the classification loss on the original basis.Specific improvements are described below.
First, for the characteristics of FSOD tasks, we introduce an orthogonal loss Lo[46].As shown in Fig.5, by applying orthogonality to the feature space before classification, the features of different classes are separated and the features of the same class are aggregated, thereby ensuring that the similarity between features can be well measured under the condition of few samples.Let the input and output pairs of the network be {xi,yi}, xiis the input image,yiis the class label,and the feature before the classification layer is F = {f1,f2,……,fn}, The orthogonal loss Lois defined as follows:
where φ is the hyperparameter that controls the weight, |•|represents the absolute value operation, and ‖•‖2represents the ℓ2norm operation.In the orthogonal loss Lo, the clustering between samples of the same class is promoted by constraining 1-p to approach 0, and the separation of samples of different classes is promoted by constraining q to approach 0.
Then, we improve the original classification loss and suppress the confusing targets of other classes by highlighting the weight of similar classes.The modified classification loss is expressed as follows:
where i is the grid cell,which is divided into K×K in total.j is the anchor,and a grid cell has a total of S anchors.Iobjijindicates whether the corresponding anchor contains the object, and ^pi(c) and pi(c)indicate the truth value and predicted value of the class, respectively.The value of ^pi(c) is 0 or 1.Different from the original classification loss function, we add a weight term (1 +pi(c))μto the second term to suppress hard negative samples, where μ is a hyperparameter greater than 1 to control the degree of suppression according to the magnitude of different predicted values.When^pi(c) =0,it means that the sample is a negative sample.The larger pi(c)is,the more likely it is to be wrongly determined as a positive sample,that is,it is easier to become a hard negative sample that is difficult to identify by the network.The weight item we add can adaptively increase the loss proportion of hard samples according to the value of pi(c)and effectively suppress hard negative samples.To sum up, our optimized classification loss is
Fig.5.Schematic diagram of the action process of the orthogonal loss.The same shapes and colors in the figure represent the same class(such as orange stars),and different colors and shapes belong to different classes (such as green triangles and blue squares).
where θ is the hyperparameter controlling the orthogonal loss ratio.
In this section, we evaluate the performance of our FSOD detector through experiments on two general object detection benchmarks.The main contents include: dataset setup, experimental details, experimental results and ablation experiments.
We mainly conduct experiments on two widely used object detection benchmarks,PASCAL VOC 2007[47]+2012[48]and MS COCO [49].When experimenting on the PASCAL VOC, following convention[3,23],we use the training set of VOC2007 +2012 and the validation set of VOC2012 to train the detector,and use the test set of VOC2007 for evaluation.PASCAL VOC contains a total of 20 classes, following the settings of Refs.[7,37,49], we perform three different partitions for these 20 classes, each of which contains 15 base classes and 5 novel classes.The three division methods are as follows (the first five are novel classes, and rest represents all the remaining classes as base classes):
◆Split1:{bird,cow,bus,motorbike,sofa/rest}
◆Split2:{horse,cow,bottle,aeroplane,sofa/rest}
◆Split3:{cat,sheep,boat,motorbike,sofa/rest}
In the first stage of training, all training samples are from the base class.In the second stage of training, only the novel class samples containing K labeled objects and the base class samples containing K labeled objects are used,and K in the experiment is set to 1, 2, 3, 5,10.MS COCO is a larger-scale object detection dataset consisting of 80 classes, with a total of 80 k training images, 40 k validation images and 20 k testing images.For experiments on the MS COCO, following the setting of Ref.[50], we use all training set images and 35 k validation set images to train the detector and use the remaining 5 k validation set images for evaluation.At the same time, the same 20 classes as PASCAL VOC are considered as novel classes, and the rest 60 classes are treated as base classes.In the fine-tuning stage, we set K to 10 and 30.For the sake of better evaluation about the cross-domain generalization ability of the model [51,52], we use 60 classes different from Pascal VOC in MS COCO as the base class training model,and detect 20 novel classes on Pascal VOC.For evaluation metrics, we use AP50to evaluate model performance on PASCAL VOC, using AP50∶95, AP50, and AP75to evaluate model performance on MS COCO.
In all training stages, we use SGD as the optimizer with momentum of 0.9, mini-batch size set to 16, and L2weight decay of 0.0005.In the first stage of training,the batch size is set to 64,the basic detection branch trains 160 epochs in total, and the initial learning rate is set to 10-4.and the strategy of gradual learning rate decay is used:the learning rate becomes 10-3at 10 epochs,10-4at 80 epochs, and 10-5at 120 epochs.For the second-stage finetuning,considering that the CutMix[53]and Mosaic[25]strategies on YOLOv4 will damage a few number of novel samples,we turn off the original data augmentation strategy of YOLOv4 and only use our HMPSA module processes samples.Meanwhile, only the last fully connected layer of the detection head is trained, other layers are frozen.The reinforcement branch's parameters are consistent with the basic detection branch.The model is trained for a total of 500 epochs,and the learning rate is set to 10-3.Other hyperparameter settings are displayed in Table 1.
In this section, we mainly verify the performance of our raised few-shot object detector through three groups of experiments.The three groups of experiments are: experiments on PASCAL VOC,experiments on MS COCO and cross-domain experiments from MS COCO to PASCAL VOC.Specifically, we compare our detector with several state-of-the-art few-shot object detectors: Meta R-CNN[50], TFA [54], MPSR [37], Retentive-RCNN [55], DCNet [56],LSTD(YOLO)-full[20]and Meta YOLO[7].At the same time,we also use the unmodified YOLOv4 as the few-shot detector for comparison.The training strategy of YOLOv4 is also separated into two phrases.In the first phrase, the same training strategy as our detector is adopted.In the second phrase, the network is fine-tuned by using the novel class data until the network is fully converged.We refer to this method as YOLOv4-tf-full for short.
4.3.1.Results of PASCAL VOC
Table 2 reveals the comparison of novel class detection performance between our model and other detectors on PASCAL VOC.First, the experimental results indicate that our proposed optimization strategy for FSOD can indeed effectively boost the detection performance compared with the YOLOv4 baseline detector.Especially in the case of extremely scarce samples(K=1,2,3),the gain amplitude is particularly obvious.In split 1, when K = 1, the performance improves by as much as 24.9%.When the sample increases (K = 5, 10), the performance of the YOLOv4 baseline is greatly improved, but our method still surpasses the baseline.Compared to the baseline, our method achieves an average performance improvement of 5.2% and 5.8% at K = 5 and K = 10,respectively.
Furthermore,our model also displays outstanding performance compared to other FSOD detectors.For the LSTD and Meta YOLO models whose baselines are also YOLO series, our model has a significant performance advantage.On the one hand, this benefits from the improved baseline performance,but on the other hand,it relies more on our proposed model improvements for FSOD.Our model is also competitive compared to the models designed based on Faster R-CNN.As revealed in Table 2, our model can achieve optimal or sub-optimal performance when samples are scarce(K = 1, 2, 3).Even though the sample is gradually increased, our model is at a relatively advanced level.
Table 1Hyperparameter settings.
Table 2Performance(mAP50)comparison of various FSOD methods on PASCAL VOC.The novel class detection performance for three different partitions is reported separately.“FRCN R-101” is an abbreviation for Faster R-CNN ResNet-101, “-" indicates that no results were reported in the original text, and the best performance is shown in bold.
4.3.2.Results of MS COCO
Compared to PASCAL VOC, MS COCO is a more challenging detection benchmark with more categories and more complex backgrounds.Table 3 reports our detection performance evaluation results for 20 novel classes in the 10-shot and 30-shot cases.Likewise, our model has a clear performance advantage over the YOLOv4 baseline detector and other state-of-the-art FSOD models.Whether it is 10-shot or 30-shot, our model achieves leading performance.This also strongly proves the availability of our raised improvement strategy for improving FSOD.
4.3.3.Results of MS COCO to PASCAL VOC
To evaluate the cross-domain generalization ability of our model,we combine PASCAL VOC and MS COCO for testing.We train the model with 10 novel class annotated objects on MS COCO and evaluate on the VOC2007 test set.We adopt Meta R-CNN, MPSR,LSTD(YOLO)-full,Meta YOLO and YOLOv4+ft-full as the comparison benchmarks for cross-domain performance.Fig.6 displays the experimental results.From Fig.6 we can clearly find that our method achieves the best performance.At the same time, we also find that the cross-domain novel class detection performance becomes worse compared to the intra-domain experimental results on PASCAL VOC.On the one hand, this is related to the increase in the number of novel classes for evaluation (from 5 to 20), on the other hand, it also shows that there is a performance degradation problem in cross-domain detection, which needs further research to solve.
Fig.6.10-shot cross-domain detection performance(mAP50)comparison for 20 novel classes under COCO to PASCAL.
To evaluate the different factors’influence on the results in our model, we performed an ablation study on split 1 of PASCAL VOC,and Table 4 reveals the results of our ablation experiments.HMPSA,SNFA and OCL (including OL and IMP) represent our proposed improved components, namely a hybrid multivariate positive sample augmentation module, a selective non-local fusion attention module and an optimized classification loss, respectively.We combine these components in various ways to observe their specific contributions to the results.
4.4.1.Hybrid multivariate positive sample augmentation module
The first two rows of Table 4 indicate that the HMPSA module dramatically boosts the model's performance.In different settings of K = 1, 2, 3, 5,10, mAP is improved by 9.1%, 7.6%, 8.3%, 4.0% and 3.4%, respectively.Moreover, we find that it helps performancemore when the samples are scarce,which is exactly what is needed in few-shot detection.Meanwhile,HMPSA contributes more to the performance improvement than the other two factors.HMPSA optimizes the distribution of novel class positive samples,increases samples' diversity, and enables the model to better learn sample features in different environments when given a few positive samples, which fundamentally effectively avoids the overfitting and poor generalization caused by scarcity.At the same time,HMPSA suppressed the negative samples' interference, which further improves the model performance.
Table 3Comparison of the performance of various FSOD methods on the MS COCO (using AP50:95,AP50 and AP75, respectively).The performance under 10-/30-shot is reported respectively.“FRCN R-101”is an abbreviation for Faster R-CNN ResNet-101,“-"indicates that no results were reported in the original text,and the best performance is shown in bold.
Furthermore, for the sake of further demonstration about the superiority of our raised sample augmentation method,we make a comparison between our HMPSA and several other sample augmentation methods on the same benchmark.These methods include: HMPSA-removed benchmarks (which we denote by ‘*‘),SINP [57], SNIPER [58], and Object Pyramid [37].SINP and its improved version SNIPER enhance specific areas rather than all pixels.The Object Pyramid expands the scale of positive samples by sampling.We did a comparative experiment with K = 1, 5, and 10 on split 1 of PASCAL VOC,and the experimental results are revealed in Fig.7.It can be found from Fig.7 that our method is superior to several other methods.In SINP and SNIPER,the number of negative samples is also amplified, so the performance is even lower than the baseline.Although the Object Pyramid eliminates the interference of negative samples and is helpful for the detection of few samples, it only expands the samples in terms of scale and cannot meet the actual detection needs.On the contrary, our method can not only suppress the interference of negative samples, but also amplify positive samples according to the actual situation and increase the diversity of samples.
Fig.7.Comparison of the performance of our HMPSA with several methods on the PASCAL VOC(taking split 1 as an example).K=1,5 and 10 are compared respectively.*: Baseline +SNFA +OCL.
4.4.2.Selective non-local fusion attention module
Comparing the data in the first and third rows of Table 4,we can see that SNFA also help a lot in improving the performance of the model.When K = 1, 2, 3, 5, and 10, mAP increased by 6.8%, 2.5%,3.4%, 1.4% and 1.5%, respectively.This verifies the effectiveness of SNFA for improving detection performance.For the sake of more intuitive demonstration about the help of our raised SNFA for performance improvement, we visualize the attention of some novel classes on two datasets respectively.Fig.8 shows the attention comparison results: Baseline vs.SNFA.The top image of each dataset is the baseline attention result with the SNFA module removed, and the bottom image is the attention result with the SNFA module added.As can be seen from Fig.8,our SNFA is able to provide better attention regions for novel class detection,i.e.better express novel class features.For a simple single object existing in the image,the attention of our SNFA and the baseline is not much different, but the SNFA can pay more attention to the object itself rather than the environment,as shown in Fig.8(a),Fig.8(c),Fig.8(f)and Fig.8(g).Our SNFA is also able to provide more attention when the objects are occluded, such as the bird in Fig.8(b).When there are other significant interfering objects in the image, such as the motorbike in Fig.8(d)and the dog in Fig.8(i),SNFA can reasonably allocate attention and avoid confusing the detector.For the case where there are multiple objects and small objects in the image,the SNFA focus area is also more reasonable, such as horse in Fig.8(e)and boat in Fig.8(j).Overall,our proposed SNFA is able to optimize the feature representation of positive samples,making it easier for the network to learn purely novel class features,thereby improving detection accuracy.
4.4.3.Optimized loss function
We split the optimized loss function into an orthogonal loss and an improved classification loss and evaluate them separately.The fourth and fifth rows of Table 4 reveal the evaluation results.Under the settings of K=1,2,3,5,10,the mAP is improved by 4.7%,1.7%,1.9%,0.9%and 0.9%for the orthogonal loss part,respectively.For the improved classification loss part, mAP is improved by 3.8%, 1.4%,1.5%, 0.7% and 0.6%, respectively.In general, for the case of scarce samples(K=1,2,3),the optimized loss function can bring at least 3% performance improvement, which can further boost the fewshot detector's performance.
In addition, we also conduct experimental analysis of three hyperparameters in the process of optimizing the loss function.Table 5 displays the experimental results.Our experiments are conducted on split 2 of PASCAL VOC and report results for K=1,5,10 settings.The first row of Table 5 is the best hyperparameters that we adjusted through repeated experiments.First, we fix the hyperparameter μ in the improved classification loss part and modify the two hyperparameters φ and θ related to the orthogonal loss.When θ is held constant and φ is changed, we find that the model performance has a trend of increasing first and then decreasing as the value of φ increases.This shows that in the orthogonal loss,although high weights that promote the separation of different classes of samples are more suitable for few-shot detection, the weights cannot be increased excessively, otherwise the model performance will be compromised.Similarly, this trend also exists on θ, according to the data, higher orthogonal loss weights can lead to better performance for the model compared to the improved classification loss term.Furthermore, we found that changing θ has a larger impact on performance(more performance fluctuations) than changing φ.Then, we fix φ and θ, change μ and see how it affects the results.We found that the trend of rising first and then falling also exists, which indicates that although increasing the weight of hard negative samples can improve the model's performance,we should not give excessive concern to the hard negative samples.
Fig.8.Visualization results of attention on some novel classes on two datasets:Baseline vs.SNFA:(a)Bus;(b)Bird;(c)Cow;(d)Motorbike;(e)Horse; (f)Train;(g)Aeroplane; (h)Cow; (i) Dog; (j) Boat.
Table 5Comparison of the performance impact of different hyperparameter settings in the optimized classification loss function (taking VOC split 2 as an example).
Finally,we combine the three components in pairs and observe their impact on performance.The sixth,seventh and eighth rows in Table 4 show the experimental results.It can be revealed from the obtained data that the superposition of each component will not harm each other, but can achieve better performance after superposition.
We present visualizations of our detection results for novel classes on both datasets in Fig.9.We separately picked some typical success and failure cases to help analyze our model's performance.From the success stories in the figure, we can see that our model can obtain good performance when the object is relatively simple and single.When the object is located in a complex environment,such as occlusion,small object scale,or other significant interfering objects in the scene, our model can still show excellent performance.But even so, our model still suffers from some false detection failures, as shown in the rightmost column in Fig.9.It can be found from the figure that these failure cases are mainly caused by the lack of key features of the object or being too similar to other classes, which shows that distinguishing confusing objects under the circumstance of scarce samples is a worthy research direction.
This article raises a new FSOD model on the benchmark of YOLOv4.Our model mainly starts from improving the quantity and quality of samples,which can effectively boost the performance of FSOD tasks.Specifically, we first put forward a HMPSA module,which can amplify the number of positive samples for different actual situations and increase the diversity of samples.Second,we design a SNFA module, which can achieve better feature representation and enable the network to learn higher-quality positive sample features.Finally,for the specific task of few-shot detection,we optimize the loss function of the baseline detector to better serve the task of few-shot detection.Experiments on two benchmark data sets indicate that our model is better than other FSOD methods when samples are scarce,and can better adapt to the task of FSOD.
Fig.9.Visualization of detection results of our model for novel classes on PASCAL VOC and MS COCO.Including success cases(green boxes)and failure cases(red boxes),gray boxes are detected irrelevant classes.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors would like to acknowledge the China National Key Research and Development Program(Grant No.2016YFC0802904),National Natural Science Foundation of China(Grant No.61671470)and 62nd batch of funded projects of China Postdoctoral Science Foundation (Grant No.2017M623423) to provide fund for conducting experiments.