Feature Matching via Topology-Aware Graph Interaction Model

2024-01-27 06:47YifanLuJiayiMaXiaoguangMeiJunHuangandXiaoPingZhang
IEEE/CAA Journal of Automatica Sinica 2024年1期

Yifan Lu , Jiayi Ma ,,, Xiaoguang Mei , Jun Huang , and Xiao-Ping Zhang ,,

Abstract—Feature matching plays a key role in computer vision.However, due to the limitations of the descriptors, the putative matches are inevitably contaminated by massive outliers.This paper attempts to tackle the outlier filtering problem from two aspects.First, a robust and efficient graph interaction model,is proposed, with the assumption that matches are correlated with each other rather than independently distributed.To this end, we construct a graph based on the local relationships of matches and formulate the outlier filtering task as a binary labeling energy minimization problem, where the pairwise term encodes the interaction between matches.We further show that this formulation can be solved globally by graph cut algorithm.Our new formulation always improves the performance of previous localitybased method without noticeable deterioration in processing time,adding a few milliseconds.Second, to construct a better graph structure, a robust and geometrically meaningful topology-aware relationship is developed to capture the topology relationship between matches.The two components in sum lead to topology interaction matching (TIM), an effective and efficient method for outlier filtering.Extensive experiments on several large and diverse datasets for multiple vision tasks including general feature matching, as well as relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multi-modal image matching, demonstrate that our TIM is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness.The source code is publicly available at http://github.com/YifanLu2000/TIM.

I.INTRODUCTION

FEATURE matching, which aims to construct reliable correspondences between two feature sets, is a fundamental problem in computer vision, and acts as a premise to tackle many important tasks including structure from motion (SfM),simultaneous localization and mapping (SLAM), image registration and fusion [1], [2].Nevertheless, the matching problem possesses an extremely high computational complexity due to the combinatorial nature.To reduce the searching space, a two-step pipeline is typically selected.First, a set of putative correspondences is constructed based on the similarity of feature descriptors which are robust to a set of transformations (e.g., SIFT [3] and SuperPoint [4]); then, a correspondence pruning algorithm is adopted to remove the false matches (a.k.a.outliers) while keeping the true matches (a.k.a.inliers) to a maximum by imposing geometric constraints [1],[5]–[7].This paper focuses on outlier filtering from a given set of putative matches, which is at the core of the two-step pipeline.

In this context, a variety of methods from handcrafted to deep learning [1] have been investigated to address the outlier filtering problem in the past few decades.Nevertheless,designing a general, effective and efficient algorithm is still a huge challenge, which mainly includes three aspects.First,due to the limitations in the local descriptors, the set of nearest neighbor matches generally contains a great majority of false matches, especially when the images suffer from lowquality, occlusion or repetitive patterns.Second, in real-world scenarios, the transformation models between two images vary with different image types, which can be either parametric (e.g., homography and epipolar geometry), non-parametric(e.g., non-rigid deformation), or even multiple visual patterns(e.g., multiple objects undergoing independent movements).The complexity of real-world scenarios makes it hard to devise a general algorithm.Third, the high computational complexity is another obstacle, especially when outliers dominate.

Recently, several efforts [8]–[11] have tried to tackle the above-mentioned challenges from a locality preserving perspective.The philosophy behind this is that the local neighborhood structure among feature points may not change freely due to the physical constraints in a small region around an inlier, while the same does not hold for randomly occurring outliers.By measuring the preservation of neighborhood structure of each correspondence, the inliers and outliers can be easily identified.The locality preserving strategy is very efficient and effective in most simple cases, but the problem arises when the majority of matches are outliers (e.g., 90%outlier ratio, which often happens in real-world scenarios).For one thing, the large number of outliers affects the local topology of inliers.If an inlier is surrounded by outliers, then this inlier will be identified as outlier incorrectly.For another, the randomly occurring outliers will also happen to conform to the topology consistency, making various gross outliers preserved.

This paper addresses the above limitations in two folds.First, a novel graph interaction model is proposed.In previous locality preserving model, each pair of correspondences is treated independently and judged separately; however, correspondences are correlated and there should be interaction between correspondences.For example, a correspondence around and consensus with an inlier is more likely to be an inlier.To this end, we cast the outlier filtering task as a binary labeling energy minimization problem to include pairwise interaction between correspondences.The derived energy function is shown to be regular and can be solved globally and exactly by graph cut technique in polynomial time.In our experiment, adopting our graph interaction model as a replacement for locality preserving-based methods always improves the performance without noticeable deterioration in processing time, adding generally a few milliseconds.

Second, a robust and geometrically meaningful topologyaware relationship is designed.In previous efforts, the topology metrics were modeled implicitly, which is very relaxed,e.g., element and motion consensus [8], and ranking similarity [9].The relaxed metrics can accommodate various complex scenarios, but also tend to include many outliers.We therefore introduce the topology-aware relationship which explicitly models topology consistency and, is scale, rotation,and translation invariant.The introduced relationship is further extended to affine invariant by PCAW normalization technique.We term the combination of the graph interaction model and topology-aware relationship as topology interaction matching (TIM).The qualitative and quantitative experiments on several vision tasks demonstrate that TIM can achieve comparable or better performance compared to stateof-the-art methods.

The contributions of this paper are summarized as follows:

1) We propose a robust yet efficient graph interaction model for outlier filtering, where correspondences are considered to be correlated rather than independent.We also show that the derived model can be solved globally and efficiently by graph-cut technique.

2) We design a topology-aware relationship between correspondences.It can capture the correct geometric relationship between correspondences without being affected by outliers,thus building a more robust graph structure.

3) We combine the graph interaction model and topologyaware relationship to obtain TIM, and apply it to several vision tasks.We demonstrate that our method has a strong generalization ability and can achieve comparable or better performance compared to state-of-the-art methods.

II.RELATED WORKS

This section briefly reviews the outlier filtering in the feature matching process.Outlier filtering is a long-standing problem which has been studied intensively, yielding a variety of approaches with different objectives.

Resampling-based methods are the most prevalent paradigm for this outlier filtering task and were first developed by the classic random sample consensus (RANSAC) [5] in 1981.It follows a hypothesize-and-verify strategy that repeatedly samples a minimal subset from the data to fit a predefined parametric model as a hypothesis, e.g., 2 for a line, 4 for a homography, and 7 for a fundamental matrix.Then, the model quality, i.e., the inlier number, is evaluated.The process is iterated until a termination criterion that provides a probability guarantee of hitting an all-inlier subset is met.In the end, the model with the highest quality, polished by a least squares fitting, is returned.Various variants such as MLESAC [12],PROSAC [13], LO-RANSAC [14], and DegenSAC [15] have been proposed to improve the performance.These greatest improvements are integrated in USAC [16], giving the stateof-the-art RANSAC variant.Recently, GC-RANSAC [17]applies graph-cut technique to determine the inlier set instead of simply adopting a threshold.The user-defined inlier-outlier threshold is further eliminated by marginalizing over noise level in MAGSAC [18] and MAGSAC++ [19].Though achieving promising results, they still have several limitations.On the one hand, they require a predefined geometric parametric model, which is often not available in advance.On the other hand, their performances degrade rapidly or even fail when the outliers are dominant in the putative set.It is important to note that GC-RANSAC [17] introduces pairwise term to local optimization step in the RANSAC pipeline, which is similar to ours.However, a geometric model needs to be given in advance, which is obtained by random sampling.Moreover,its graph structure is quite simple.

Non-parametric interpolation-based methods can handle more general feature matching problems, such as deformable image matching, where there is no predefined geometric model.Representatives include identifying correspondence function (ICF) [20], vector field consensus (VFC) [21], and coherence based decision boundaries (CODE) [22].Typically,they interpolate a non-parametric vector field function based on a smoothness prior, in which the motion field is slow and smooth, and then eliminate the outliers by judging whether it is consistent with the motion field.Nevertheless, these methods generally possess O(N3) complexities that greatly limit their applicability in real-time tasks.For efficiency consideration, a sparse random basis technique proposed in SparseVFC[23] is used to reduce the space and time complexity to O(N).This idea is further extended in CRC [24], where a fixed and compact Fourier basis is adopted to construct the vector field function.Particularly, Gaussian process matching (GPM) [25]uses a Gaussian process to characterize the motion field and approximates the posterior using the inducing points.With the help of stochastic variational inference (SVI), the complexity is further reduced to O (K) whereKis the mini-batch size.The problem with these non-parametric interpolation-based methods is that they may fail when the smoothness prior does not hold, such as depth discontinuity, wide-baseline or independent motion in the scene.

Graph-based methods provide a novel perspective for solving the matching problem, which generally construct a weighted graph by associating each feature point to a node and specifying edges with geometric constraints.These methods offer convenience for exploiting the intrinsic structure among feature points.Instead of following two-step pipeline,many graph-based methods attempt to directly establish correspondences between two feature point sets [26]–[28].The graph-based approach is quite flexible, but it has the drawback of high computational complexity and is not prominent due to the simple geometric constraints.Our method also belongs to this category, but our objective can be solved globally by graph cut in polynomial time.In addition, our proposed topology-aware relationship leads to a more robust graph structure.

Locality preserving-based methods filter outliers based on the observation that the spatial local neighborhood structure of inliers is generally well-preserved due to physical constraints,while the random outliers are normally inconsistent with their neighbors.Representatives include locality preserving matching (LPM) [8] and grid-based motion statistics (GMS) [29].These methods generally have great efficiency, but the results are coarse and depend on high inlier ratio.Recently, AdaLAM[30] integrates a series of handcrafted ideas that combine local affine consistency with multiple locally RANSACs.Thanks to modern GPU parallel acceleration, AdaLAM is surprisingly fast with satisfactory matching results.Unlike locality preserving based methods only paying attention to locality, our method additionally models the interaction between correspondences and is therefore more robust to outliers.

Learning-based methods have made dramatic progress on various vision tasks due to powerful representation capabilities.In the context of outlier filtering, PointCN [6] is the first attempt to apply deep learning for this task.Inspired by Point-Net [31], a multi layer perceptrons (MLPs) based deep neural network is trained, where a simple context normalization (CN)operation is introduced to capture the global context information.In addition, the network is trained by solving a classification (inlier or outlier) and a regression (essential matrix) problem simultaneously.Follow-up efforts improve the network architecture by inserting more local context information, such as local and global clusters [32], [33], attention mechanism[34], and mining neighborhood information [35].Recently,SuperGlue [36] opened up another line of feature matching,which integrates feature matching and outlier rejection into a single fully connected graph neural network (GNN).The self/cross attention mechanism of transformer [37] is introduced to aggregate information, which exhibits surprising results.Despite achieving promising performance, the generalizability of these methods, as other data-driven methods, still remains unclear.

III.METHODOLOGY

A. Locality Preserving Model Revisited

The core idea of locality preserving model is that the distribution of neighboring point pairs after transformation should be preserved.It is conceptually simple and effective in most scenarios.To filter outliers, it defines the following cost function by preserving the local structure around matches:

To design a locality measurement function that is robust to various transformations and outliers, different forms ofD(Nxi,Nx′i;ci)have been developed, such as the neighborhood element consensus and motion consensus defined in LPM [8], and ranking similarity in mTopKRP [9], which are translation, rotation, and scale invariant.For sparse feature correspondences,D(Nxi,Nx′i;ci) typically measures the locality differences by computing the consensus of all neighborhood correspondences withci.It has the following form:

Due to the simplicity of the objective function (1), it has a closed-form solutionp∗, where

The solution is easy to understand.It suggests that the better the locality is preserved, i.e., the lowerD(Nxi,Nx′i;ci), the more likely it is to be an inlier, and vice versa.The above formulation treats each pair of correspondences independently and judges them separately, as shown in Fig.1(a); however,correspondences are correlated and there should be interaction between correspondences.For example, a correspondence around and consensus with an inlier is more likely to be an inlier, while the randomly distributed outliers share no such interaction.Using a simple hard threshold solution and neglecting the interaction between correspondences, some outliers will be identified as inliers by chance, while some inliers will be identified as outliers due to the interference of gross outliers.Next, we will show the connection of locality preserving model and graph model, so that additional interaction between correspondences can be modeled naturally.

Fig.1.Framework of our graph interaction model and comparison with locality preserving model.(a) Locality preserving model calculates a score for each match based on locality information and judges each match independently by hard thresholding.However, the presence of massive outliers degrades its performance as it relies only on the locality; (b) Our graph interaction model utilizes the same locality information to construct a graph.Instead of thresholding separately, we consider the matches to be relevant and model their interactions, which can then be solved by graph-cut technique.This approach leads to better matching results.In the image matching plots, blue: true match; red: false match.Best viewed in color.

B. Graph Interaction Model

1)Graph Structure From Locality Preserving Model: The above-mentioned locality preserving model measures the locality disparity for each correspondence by calculating the consensus with the surrounding correspondences.This consensus relationship between correspondences can be represented by a graph, which provides more flexibility to outlier filtering task.Consider a weighted directed graphG=(V,E)withNnodes where each node represents a correspondence in C.To avoid ambiguity, in the following, we usecto denote the nodes in V.To encode the consensus relationship into the graph, we define the directed edgeei→jlinking nodecjtocias follows:

Remark 1:The criterion for judging inlier-outlier of locality preserving model in (4) is equivalent to measuring the outdegree of each node in the graph

The conclusion of Remark 1 is quite intuitive.The constructed graph represents the locality relationship between the correspondences.The more edges a node has connected to other nodes, the better the local structure is preserved, the more likely it is to be an inlier, and vice versa.We show a real-world example in the middle of Fig.1(b), where the graph structure is derived directly from the LPM [8].We find a more dense connection between inliers, as they are consensus with the surrounding inliers, while the outliers are the opposite.

2)Problem Formulation: With the well-constructed robust graph structure where edges encode the local consensus relationship between nodes, we formulate the outlier filtering task as a binary labeling (L∈{0,1}N, 0-outlier, 1-inlier, which plays the same role as the indicator vectorpin (1)) energy minimization problem containing unary and pairwise terms as follows:

whereαis a trade-off balancing two terms.The unary term measures the local structure preservation of each correspondence, as locality preserving model does, and the pairwise term models the interaction between correspondences.Note that when α=0, it will degenerate to locality preserving model as indicated by Remark 1.

a)Unary term: The original locality preserving criterion using a step function can be reformulated as

where

λ˜=1-λ for simplicity;Li∈Lis the label assigned toci.Since quantization using step function leads to loss of information about the confidence that a correspondence is inlier or outlier, we use a Gaussian kernel for continuity

where ϵ2=λ˜2/2ln2.The unary term encourages those matches with highdeg+(ci) to be labeled as inliers, and vice versa.

b)Pairwise term: The graph structure encodes relationship between matches, and hence the interaction between two connected nodes is naturally modeled in the pairwise energy term

whereVi,j(Li,Lj) imposes smoothness constraint evaluating the cost of assigning the labelsLiandLjto nodesciandcj.In the field of image segmentation, stereo matching, etc., a commonly used model is the Potts model, which encourages the neighboring nodes to have the same label

In our setting, the two labels correspond to the inlier and outlier with different distributions.Since only the inliers are consensus with each other, edges usually exist between the inliers, while few between the outliers, as is shown in Fig.1.Consequently, when two nodes are connected by an edge, we tend to penalize two cases, i.e., those are with different labels and those are both outliers.Taking the edge weights into consideration, the pairwise function turns to

3)Solution: The energy function (8) can be justified on Bayesian grounds using the well-known Markov random fields [38], which is quite difficult to optimize using general optimization methods such as simulated annealing because of the highly non-convex and high-dimensional nature of the objective function.In addition, such optimization techniques usually require exponential computational complexity, which is extremely slow in practice.In the past two decades, many algorithms have tackled this problem efficiently, e.g., belief propagation [39], graph cut [40], and dual decomposition [41].

Since our pairwise term satisfies the inequality

the energy function (8) is regular [40], which allows it to be minimized globally using graph-cut.By introducing two special nodes (terminals) sourcesand sinkt, and constructing the problem graph, the minimization problem is equivalent to minimum s-t-cut problem, which is solvable exactly in polynomial time.For a detailed discussion to this conclusion, we refer to the seminal work of [40], [42].

C. Topology-Aware Relationship

From the above discussion, it is clear that the graph structure encoding the relationship between correspondences, is at the core of the graph interaction model, which controls the performance of outlier filtering.That is to say, how to define the consensus relationshipd˜(cj;ci) is crucial.Previously developed relationship [8], [9] is invariant to translation, rotation, and scale.However, the problem arises when outliers dominate.Since the topology is not explicitly modeled, the metrics of these methods are rather relaxed and simple, e.g.,measuring only motion vector or ranking consistency, which result in a large number of outliers.

To solve this dilemma, a topology-aware relationship is introduced, which models the topology consistency explicitly.Some real-world examples are shown in Fig.2, where the right three columns present the graph structures and matching results of LPM [8], mTopKRP [9], and ours, respectively.Clearly, our topology-aware relationship is more robust to outliers and geometrically meaningful, and therefore resulting in better matching performance.The topology-aware relationship is based on the observation that, in real-world scenarios,the local topology structure among a correct feature correspondence does not change freely due to the physical constraint, while the same is not true for outliers [8], as shown in the left of Fig.3.By explicitly modeling the topology consistency, we will show that the topology-aware relationship is more robust to outliers, even if the outlier has a similar motion vector to the nearby inliers.

We begin with a simple case, where the local transformation of feature points approximately follows a similarity transformation:

wheresis a scalar representing the isotropic scaling,R∈S O(2)denotes the rotation matrix,tis a translation 2Dvector, and N denotes a local neighborhood.For a putative feature correspondenceciwith its neighborhoodcj∈Nci,s(i,j)andR(i,j)encoding the topology information can be directly obtained by solving the following equation:

The solution is simple, e.g., the rotation matrix can be calcu-

Fig.2.Different graph structures derived from different methods.Our graph structures constructed from topology-aware relationship is robust to outliers and geometrically meaningful.Best viewed in color.

Fig.3.Topology consistency.The topology structure of inliers is consistent, while different for the outliers, even though they both have similar motion vectors (left).The topology consistency is also extended to affine invariant by PCAW normalization (right).Best viewed in color.

Affine Invariant: In real-world scenarios, however, the topology consistency priori characterized by simple similarity transformation is barely the case.By comparison, the affine transformation is more reasonable and commonly used to characterize the local topology structure [30], [43], [44].In this case, the two local point sets satisfy

tively.Then the normalized data after PCAW can be expressed as

whereD=diag{λ1,λ2} is a diagonal matrix.

Lemma 1:If two affine related point setsXandX′are principal component analysis whitening (PCAW) normalized,then they will be rotationally equivalent

whereR(θ)∈S O(2) denotes a rotation matrix.

Proof: The two affine related correspondences sets satisfy

After the first step, we get the following equation:

which indicates thatXˆ′andXˆ are only related linearly by a 2×2 matrix.Hence, their covariance matricesC′and C meet the following equations:

information, which also works very well.

Now, with the PCAW normalization technique, the local affine case can be simply converted to the local similarity case, which has been discussed in detail previously.

D. Topology Interaction Matching

The topology-aware relationship can be directly embedded into the graph interaction model and solved globally by graphcut technique.We term the combination of these two novel parts as topology interaction matching (TIM).For the sake of effectiveness and efficiency, we additionally introduce two operations, namely, distance weighting and seed selection.

1)Distance Weighting: Assuming that the closer the distance between two nodes, the stronger the interaction is, and vice versa, the topology relationship between two nodes is rewritten as

where //ci-cj// calculates the Euclidean distance betweenciandcjin 6D space, and ε2is a hyperparameter controlling the scale of distance weight.

2)Seed Selection: In the image feature matching task, there are typically thousands of correspondences.Thus, it would be a huge computational burden if the topology-aware relationship is computed for each correspondence and its neighborhood.In addition, due to the presence of large number of outliers, most of the computations are meaningless.To this end, a straightforward strategy is to select correspondences with high inlier confidence as seed nodes [28], [30], e.g., by ratio test.Then, we only calculate the topology-aware relationship between the seed nodes and other nodes when constructing the graph structure.By using this seeded strategy, we find that the efficiency of our algorithm can be greatly improved without loss of matching performance.The whole algorithm of TIM is outlined in Algorithm 1.

Algorithm 1 Topology Interaction Matching C={ci |ci=(xi,x′i)}N i=1 ˜λ ε2 Input: putative set with ratios R, parameters, α, K, τ, ;I∗Output: inlier set ;G 1 Initialize graph with N nodes;S 2 Select seed nodes using ratio test;3 Construct K-nearest neighborhood of seed nodes using in 6D space;ci S{Nci|ci ∈S}C 4 foreach node in do{xj|c j ∈Nci} {x′j|c j ∈Nci}5 Normalize and by PCAW, respectively;6 foreach node do cj ∈Nci 7 Add edge according to (5) and (32);8 end 9 end ci ∈C ei→j 10 foreach node do 11 Construct unary term according to (11);12 end ei→j ∈E 13 foreach edge do 14 Construct pairwise term according to (14);15 end L∗16 Obtain the optimal label by graph-cut algorithm;I∗={ci|L∗i =1, L∗i ∈L∗}17 Determine the inlier set.

E. Computational Complexity

F. Implementation Details

In particular, to keep the feature points of the image pair at the same scale, the feature point coordinates are first normalized by the image size.For seed selection, we adopt ratio test with 0.8 threshold, which achieves the best balance between matching performance and efficiency.Since the graph structure of our method is very sparse, and we find that some nodes do not have any edges to other nodes, these nodes can be labeled as outliers in advance.This operation can speed up our algorithm.In addition, applying the iterative strategy to our method is straightforward for better matching performance.However, we found no significant performance improvement but a noticeable longer running time.Hence, we do not adopt the iterative strategy in our experiments.

There are five parameters in our method: λ˜ ,α,K,τ, and ε2.The first two parameters are used in the graph interaction model.λ˜ controls the penalty of topology preservation in unary term.αis a trade-off parameter balancing unary term and pairwise term.The remaining three parameters are used in the topology-aware relationship.Krepresents the size of the nearest neighborhood.τis a threshold for quantifying cost in(20).ε2controls the scale of the distance weight.In our experiments, we empirically set the default values as λ˜=0.5,α=0.95,K=14, τ =0.8, and ε2=0.2.

IV.EXPERIMENTS

In order to comprehensively evaluate the performance,extensive experiments are conducted, including general purpose feature matching, relative pose estimation, homography and fundamental matrix estimation, loop-closure detection,and multi-modal image matching.In the end, we provide ablation studies for better understanding each component of our method.All the experiments are conducted on a desktop with 2.9 GHz Intel Core i7-10700 CPU, 16 GB memory with MATLAB code, except deep learning methods and one particular method AdaLAM [30] that requires GPU parallel acceleration.We run them on a different machine with RTX3090 GPU and Python code.

A. Results on Feature Matching

We first focus on establishing feature correspondences between real image pairs for general purpose, i.e., it should not have a predefined transformation to cover more scenarios.To this end, we select five publicly available datasets involving different types of image transformation as follows:

1) RS [47].The dataset captured by satellites is a remote sensing dataset, including color-infrared, SAR, and panchromatic photographs.It contains 161 image pairs which generally undergo a homography transformation.

2) VGG [43].The dataset is either a planar scene or the camera position is fixed.Thus, the image pairs in this dataset always obey homography.It contains 40 image pairs where ground-truth homographies are supplied.

3) DTU [48].The dataset is originally designed for multiple view stereo evaluation, which involves a lot of different scenarios with a wide range of objects.Each scenario has been taken from 49 or 64 positions, where the high accuracy ground-truth camera positions and internal camera parameters have been provided.We choose two scenarios from the dataset (i.e., Frustum and House) and create 131 image pairs in total for evaluation.

4) AdelaideRMF [49].The dataset containing 38 image pairs is divided into two parts.The first 18 image pairs including different buildings are mainly related to epipolar geometry.The other 18 image pairs are related to multiple motions,i.e., multiple objects exist and undergo independent movements between two images, which is an immense challenge for many methods.

5) Nonrigid.The dataset is a combination of the 720 Yun[50] and FE [51] datasets.Specifically, 720 Yun is a cloud dataset containing 20 image pairs including terrain, roads,buildings, terraces, etc.FE is taken by a fish-eye camera from two scenes such as University and Urban with 32 image pairs.A total of 52 images from both datasets undergo nonrigid transformation, which makes matching more difficult.

The correctness of each correspondence in a putative set is determined based on the ground truth information provided by the datasets and then checked manually.

Fig.4.Qualitative feature matching results of our TIM on 9 representative image pairs.For each group of the results, the left plot is the matching result, and the right motion field provides the decision correctness of each correspondence in the putative set, where the head and tail of each arrow correspond to the positions of feature points in two images.In addition, the statistical results of each image pair are reported in the top left corner of the images.Blue: true positive;Black: true negative; Green: false negative; Red: false positive.For visibility, in the left plot, at most 200 randomly selected matches are presented, and the true negatives are not shown.Best viewed in color.

Twelve state-of-the-art methods from handcrafted to deep learning are chosen for comparison, namely, RANSAC [5],MAGSAC++ [19], LPM [8], GMS [29], CRC [24], MCDM[52], GLOF [53], AdaLAM [30], PointCN [6], OANet [32],CLNet [33], and MSANet [54].Briefly, RANSAC and MAGSAC++ are resampling-based methods, LPM focuses on preserving locality, GMS is a grid-based motion statistics method, CRC interpolates a smooth vector field using Fourier bases, MCDM encodes motion consistency into a graph and then solves a probabilistic graphical model, GLOF removes outliers based on local outlier factor, AdaLAM filters outliers by local affine motion verification with sample-adaptive threshold, PointCN adopts a PointNet-like architecture with a simple context normalization operation to encode global context information, OANet clusters unordered correspondences to capture local context, CLNet designs local-to-global dynamic graphs to estimate consensus scores progressively,and MSANet proposes a multi-scale attention block to enhance the robustness.All methods are implemented based on their publicly released codes, and we have tried our best to tune the parameters and then fix them throughout the feature matching experiments.It should be mentioned that RANSAC and MAGSAC++ use the homography model for RS dataset and fundamental matrix model for others since the homography model characterizes the remote sensing data more appropriately.In addition, we find that the downsampling rate in CLNet has a significant impact on the performance of feature matching especially when the inlier ratio is high.Thus, we change the default downsampling rate from 0.5 to 1 in this experiment.

The precision (P), recall (R) and F-score (F) are selected as the evaluation metrics, where precision is defined as the ratio of the true inliers among those preserved “inliers” by a matching algorithm, recall is defined as the percentage of preserved true inliers among the whole inliers contained in the original putative set, and F-score is defined as the ratio of 2 × P × R and P + R.

1)Qualitative Illustration: To provide a qualitative illustration of the feature matching performance of our TIM, we select 9 representative image pairs undergoing different types of transformations (e.g., epipolar geometry, homography,independent motions, and nonrigid deformation) and present the visual and statistical results in Fig.4.For the results of each group, the left plot demonstrates the matching result schematically, and the right motion field provides the decision correctness of each correspondence in the putative set.Due to the generality and robustness of our method, the results shown in Fig.4 demonstrate promising matching performance despite facing various challenging scenarios.

2)Quantitative Comparisons on Datasets: We next conduct a quantitative evaluation of our method on five publicly available feature matching datasets and compare it to nine state-of-the-art methods mentioned above.The quantitative results are comprehensively reported and summarized in Fig.5, where each column provides the cumulative distribution of precision, recall, F-score, and runtime on a certain dataset.In addition, the average statistics of each metric are given in the legend.

From the results, it can be seen that PointCN performs the worst on almost all five datasets.This is because PointCN [6]relies solely on global context information and, as a datadriven deep learning method, it does not generalize well to unseen scenes.The follow-up deep learning methods such as OANet [32], CLNet [33], and MSANet [54] show some improvement as more complex operations are added to capture local context information.However, their performances still fall behind our method.It is important to note that these deep learning methods have surprising performances on the specific tasks and datasets they are trained on, such as relative pose estimation, as presented later.GMS [29] is the most efficient method, thus only a rough matching result can be given.Due to adopting a fixed number of basis functions, CRC [24]is also very efficient without loss of matching performance.However, its matching performance will degenerate when the smooth prior does not hold, e.g., multi-motion in AdelaideRMF and wide-baseline in DTU.The resampling-based methods such as RANSAC [5] and its variant MAGSAC++[19] always achieve high precision but lower recall.The locality-based LPM [8], MCDM [52], and GLOF [53] perform well in simple cases.Yet, their performances will degrade when facing massive outliers.The recently proposed locally affine based AdaLAM [30] almost always has a remarkably high accuracy, but the strict constraint also limits its recall.Owing to the flexibility of the graph interaction model and the robustness of the topology-aware relationship, our method is well adapted to various complex scenarios.In general, our proposed TIM achieves the best balance of matching performance with respect to the precision and recall.Note that it is hard to directly compare the efficiency between these methods since GPU implementation is significantly faster than the MATLAB version on CPU.

Fig.5.Quantitative comparisons of our TIM and other methods on five image datasets.From left to right: RS VGG, DTU, AdelaideRMF, and Nonrigid.From top to bottom: the cumulative distribution of precision, recall, F-score, and runtime.For the cumulative distribution function (CDF), a point on the curve with coordinate ( x,y ) indicates that there are (1 00×x)% of image pairs have the performance value (e.g., precision, recall, F-score, and runtime) no more than y.The average performance of each method is shown in the legend, where red indicates the best, blue the second and green the third.

B. Results on Relative Pose Estimation

For the image feature matching task, one of the most important applications is towards the recovery of the scene’s information, e.g., 3D structure.In the pipeline of this process, the feature matching methods are used in conjunction with robust estimators, e.g., RANSAC, to estimate the essential matrix and thereby recover the relative camera pose.A robust feature matching algorithm will significantly boost the accuracy of the estimation, which is beneficial for subsequent operations.

The evaluation is conducted on YFCC100M [55] outdoor scene dataset and SUN3D [56] indoor scene dataset as [32].The Yahoo’s YFCC100M dataset [55] collected 100 million photos from the internet and was later organized into 72 scenes reconstructed with the Structure from Motion software VisualSfM, providing bundle adjusted camera poses, intrinsics and triangulated point clouds.Following [32], we use 4 sequences (i.e., Buckingham palace, Sacre coeur, Reichstag,and Notre dame front facade) as our test set, which contains 4000 image pairs in total.Note that we use [57] to recover the camera poses and generate ground truth.The SUN3D dataset[56] is an RGB-D video dataset with camera poses computed by generalized bundle adjustment, which is used as indoor scenes in our evaluation.We sub-sample the videos every 10 frames and select the same 15 sequences for evaluation as[32], which creates in total 14 872 image pairs.

In our evaluation, the OpenCV SIFT [3] detector and descriptor are adopted with nearest neighbor (NN) matching to establish putative correspondences.In each image, the maximum number of keypoints is limited to 4000.The additional information of SIFT detector such as orientation and scale is also provided to the required feature matching algorithm, e.g.,GMS and AdaLAM.After performing outlier filtering step,we adopt OpenCV USAC to estimate essential matrix robustly.As a variant of RANSAC, USAC [16] is faster and more accurate than the vanilla RANSAC according to our experiment.The estimated essential matrix is then decomposed into rotation and translation.We measure the rotation and translation errors in degree and take the maximum of two,and report the exact area under the curve (AUC) with thresholds of 5 and 10 degrees.In addition, we also present the Fscore of the feature matching algorithm before USAC with respect to the ground-truth inliers.The ratio test of 0.9 is added to the comparator set.Since MAGSAC++ itself is a robust estimator, we combine the ratio test with MAGSAC++to replace the USAC to estimate the essential matrix.

Qualitative illustration of our method compared with ratio test [3], LPM [8], AdaLAM [30], and OANet [32] is presented in Fig.6.It shows that our method can accurately and robustly identify the inliers and outliers even if all other methods fail.

Fig.6.Qualitative illustration of relative pose estimation.The top two rows are outdoor scenes from YFCC and the bottom two rows are indoor scenes from SUN3D.Correspondences consistent with ground-truth epipolar geometry are shown in blue, otherwise in red.Best viewed in color with 200% zoom in.

Quantitative comparisons on outdoor and indoor scenes are reported in Table I.The results show that many regular handcrafted outlier filtering methods do not perform well, e.g.,USAC with ratio test, LPM, GMS, and CRC.This is because massive outliers will severely degrade their performances.In comparison, both our method and the recent AdaLAM have shown satisfactory results.Deep learning-based methods all perform well on these datasets, especially the recently proposed CLNet, as they are trained in such setting.Our method achieves comparable performance to these state-of-the-art deep learning methods, even outperforming OANet on the YFCC and SUN3D datasets.

C. Results on Homography and Fundamental Matrix Estimation

Homography and fundamental matrix describe the intrinsic geometric relationships of the two-view geometry.Estimating them accurately from sparse correspondences is a critical part in computer vision.To comprehensively investigate the performance of our proposed method for this task, we collect six publicly available real-world datasets, i.e., EVD (15 pairs)[58] and HPatches benchmark (580 pairs) [59] for homography, and TUM (1000 pairs), KITTI (1000 pairs), T&T (1000 pairs), and CPC (1000 pairs) [60] for fundamental matrix.The detailed description of the datasets is as follows.

The HPatches benchmark contains 116 scenes where each scene has one reference image and five target images with ground-truth homography provided for each target image.The first 57 scenes undergo different illumination and the rest 59 scenes are captured with viewpoint changes.As in [59], we use SIFT [3] for detecting feature points and HardNet [61] for generating descriptors.The number of keypoints is limited to 4000.The EVD dataset undergoes extreme view changes,namely wide baseline or extreme zoom, where the putative correspondence set is given.The TUM, KITTI, tanks and temples (T&T), and community photo collection (CPC) datasets come from a recently developed benchmark1https://github.com/JiawangBian/FM-Bench[60].The detailsare as follows: i) The TUM SLAM dataset [62] consists of short-baseline image pairs of indoor scenes in the resolution of 4 80×640.ii) The KITTI odometry dataset [63] is collected by a camera on a moving vehicle.The image pairs are shortbaseline as well in the resolution of 3 70×1226.iii) The Tanks and Temples (T&T) dataset [64] provides various objects or scenes for image-based reconstruction, and thus contains wide-baseline image pairs for evaluation.The image resolution is 1 080×2048 or 1 080×1920.iv) The community photo collection (CPC) dataset [65] includes unstructured images of landmarks collected from Flicker, where the image pairs are wide-baseline with different resolutions.The benchmark randomly selects 1000 image pairs from each dataset and establishes correspondences with SIFT [3] and nearest neighbor(NN) matching.The maximum keypoint number is also set to 4000.Note that the ground-truth model, i.e., homography or fundamental matrix, is available for all datasets.

TABLE I QUANTITATIVE COMPARISON FOR RELATIVE POSE ESTIMATION.RED INDICATES THE BEST, BLUE RANKS THE SECOND, AND GREEN THE THIRD

Fig.7.Qualitative illustration of homography and fundamental matrix estimation.The top row is homography estimation from EVD and the bottom three rows are fundamental matrix estimation from FM-Bench.Correspondences consistent with ground-truth homography or fundamental matrix are shown in blue,otherwise in red.Best viewed in color with 200% zoom in.

As mentioned before, all feature matching algorithms need to be combined with a robust geometric estimator to obtain the model.Vanilla RANSAC is used for simplicity.We evaluate the performance by comparing the deviation of the estimated model with the ground-truth model.For homography estimation, we adopt homography error defined in SuperPoint [4] as the geometrically meaningful metric.As for fundamental matrix, we follow the suggestion of the benchmark [60] and use normalized symmetric geometry distance (NSGD) as the metric, which is computed as the SGD (in pixels) divided by the length of image diagonals.It should be noted that due to the different data distributions (e.g., short and wide baseline),we set different RANSAC inlier-outlier thresholds for different datasets to achieve the best performance: 4 pixels for homography datasets (EVD and HPatches), 2 pixels for widebaseline fundamental matrix datasets (T&T and CPC), and 0.5 pixel for short-baseline fundamental matrix datasets (TUM and KITTI).We keep all settings the same for different methods.For competitors, the threshold of ratio test is set to 0.9.MAGSAC++ with ratio test is also added and denoted by the suffix “*”.

Some qualitative examples are given in Fig.7 with comparison to ratio test [3], LPM [8], AdaLAM [30], and OANet [32].The cumulative distribution functionsw.r.t.the model estimation errors are shown in Fig.8.For a more intuitive metric, we also provide the accuracy (Acc.), i.e., the ratio of accurate estimates to all estimates.An estimate is classified as accurate or not by a threshold, which is set to 4 pixels for the homography error and 0.05 for the normalized SGD as in [60].The statistics are reported in the Table II.The precision (P), recall(R), and F-score (F) are also included in the table for a more comprehensive comparison.From the results, it can be seen that our method performs well on both homography and fundamental matrix estimation tasks, especially on wide baseline datasets, such as EVD, T, and CPC.Fig.8.Quantitative comparisons of homography and fundamental matrix estimation.The top row is the cumulative distribution functions of the homography error on two homography datasets (EVD and HPatches), and bottom two rows are the cumulative distribution functions of the normalized SGD error on four fundamental matrix datasets (TUM, KITTI, T&T, and CPC).For the cumulative distribution, a point on the curve with coordinate (x,y) indicates that there are (1 00×y)% of image pairs have the performance value(e.g., homography error and normalized SGD error) no more thanx.The more accurate the method is, the closer its cure is to the top.

D. Results on Loop-Closure Detection

Loop-closure detection (LCD), aiming to tackle the drift emerging when robots travel around the route, is a fundamen-tal problem in Visual SLAM systems.This section considers the application of feature matching methods to appearancebased loop-closure detection.In this pipeline, the bag-ofwords (BoW) [66] method is first applied to select loop closure candidates to reduce the computational burden.Then outlier filtering methods are adopted to construct reliable image correspondences to detect the loop closure event.Three publicly available datasets, namely CityCentre (CC) [67], KITTI 00 (K00) [63], and KITTI 02 (K02) [63], are selected for our evaluation.Specifically, CityCentre dataset has variation in speed with resolution of 640 × 480, which contains 2474 images in total.KITTI 00 and KITTI 02 are two sequences from KITTI, which include 4541 and 4661 images respectively with resolution of 1241 × 376.All three datasets are of outdoor environment with ground truth generated by GPS logs.

TABLE II THE STATISTICS RESULTS ON HOMOGRAPHY AND FUNDAMENTAL MATRIX ESTIMATION.RED INDICATES THE BEST, BLUE RANKS THE SECOND, AND GREEN THE THIRD

For loop-closure detection task, the most important thing is to avoid false positive detections since they can significantly deteriorate the performance of a whole SLAM system and result in a completely inaccurate map.To this end, we choose the maximum recall rate when the precision is 100% as the evaluation metric.

Some qualitative results of our method embedded in LCD are shown in Fig.9 where the trajectories of robots on each dataset are drawn in black lines and the loop closure pairs are denoted by red hollow dots with blue lines connecting them.The red and green solid points are examples of false negative and true positive detection.The statistical results of quantitative comparison are reported in Table III.From the results, we can see that our method obtains the highest recall rates for 100% precision on two of the three datasets, which means our method is the most robust.In addition, none of the deep learning methods perform up to expectations on this task, especially MSANet [54] fails in two datasets, which shows that they are not able to generalize well to unseen scenes and untrained tasks.

Fig.9.Qualitative illustration of the loop-closure detection task.From top to bottom: CityCentre, KITTI00, and KITTI02.From left to right: the false negative examples, the robot’s trajectory, and the true positive examples.In the middle of each row, the black trajectory is obtained from GPS logs.The loop closure pairs are labeled as red hollow points while connecting them using a blue line.Best viewed in color.

E. Results on Multi-Modal Image Matching

Fig.10.Qualitative illustration of the multi-modal image matching task.The left columns are remote sensing multi-modal image pairs, the middle two columns are medical multi-modal image pairs, and the right two columns are multi-modal image pairs in the computer vision research area.In addition, the statistical results of each image pair are reported in the top left corner.Correspondences consistent with ground-truth homography are shown in blue, otherwise in red.

TABLE III THE STATISTIC RESULTS ON LOOP-CLOSURE DETECTION.RED INDICATES THE BEST, BLUE RANKS THE SECOND, AND GREEN THE THIRD

Multi-modal image matching, which aims to establish correspondences of same or similar objects from two images with different modalities, is a fundamental and important task for many applications, such as remote sensing, medical diagnosis,and computer vision [68], [69].In this section, we evaluate the performance of our method on multi-modal image matching task.We choose the dataset provided by [68] for evaluation,and select six different multi-modal scenes, i.e., IR-Optical,SAR-Optical, Retina, PD-T1, VIS-IR, and Day-Night, where the first two scenes are from remote sensing, the middle two is from medical diagnosis, and the last two are from computer vision.FAST detector [70] and RIFT descriptor [71] are used for feature extraction.Then nearest-neighbor matching is applied to construct the putative correspondence set.The qualitative illustration results of our proposed method and LPM are shown in Fig.10, where the statistical results are also reported in the left top corner of each image pair.From the results, we can see that our method copes well with the multimodal image matching task, which demonstrates its generalization ability and its potential for applications in multi-modal image processing.

F. Model Analysis and Limitation

1)Parameters Studies: Our TIM contains 5 parameters: λ˜,α,K,τ, and ε2.To test the performance of our method to the parameters change, we randomly selected 100 image pairs from the YFCC dataset as the test set.The results are shown in Fig.11, from which we can see that our method is not sensitive to parameters change.

Fig.11.Parameters studies of our TIM.We report the changes of various metrics with respect to parameters λ˜ , α, K, τ, and ε2.

In addition, becauseαcontrols the weights of the unary and binary terms in (8), its value is very important for the algorithm.To better understand the effect of the two terms in the proposed algorithm, we also visualize the performance of the algorithm under differentαin Fig.12.From the figure, it can be seen that as theαincreases, the larger the contribution of pairwise term, the more correspondences are found by the interactions between correspondences, making the recall increase.Conversely, as the interaction increases, the probability of finding wrong correspondences also increases.

Fig.12.Visualization of the TIM result with the parameter α change.From left to right, different columns indicate the output of TIM under different α.The statistical results of each image pair are reported in the top left corner.Correspondences consistent with ground-truth epipolar geometry are shown in blue, otherwise in red.

2)Running-Time and Scalability: To test the running time of our TIM and its scalability against different problem sizes,i.e., the input putative set size, we evaluate the running time of TIM and its two stages, namely, graph construction and graph-cut.We select a scenario from the CPC dataset and construct putative set using SIFT and nearest-neighbor matching,which results in 8845 pairs of correspondences.After that, we select different numbers of correspondences from the putative set randomly as the input to TIM and calculate the running time.The results are shown in Fig.13, where the area in different colors indicates the percentage of running time for different steps.As we can see, the running time of our TIM grows approximately linearly with the number of input correspondences, which indicates that our method has well scalability.In addition, since the graph we construct is highly sparse, the percentage of time spent in graph-cut is very small.

Fig.13.The run-time and scalability of TIM.We report the run-time of our TIM and its two steps, namely, graph construction and graph-cut, with respect to different problem sizes.

3)Ablation Studies: This section aims at understanding the contribution of each component we proposed in our method,which mainly includes three optional parts, i.e., graph interaction model, PCAW normalization for topology-aware relationship, and seed selection.For comparability with other methods, we run the same experiments on EVD and CPC in the same setting as described in Section IV-C.The results of the ablation study are reported in Table IV, where the average runtime is also given to investigate the efficiency.

a)Graph interaction model: To verify the effectiveness and generality of our proposed graph interaction model, we first choose three locality preserving based methods, namely LPM[8], mTopKRP [9], and LGSC [10].Then we replace the original model with our proposed model, where their metrics for topology and multi-scale neighborhood strategy remain unchanged.The modified versions are indicated by the suffix“GIM”, which means graph interaction model.It should be noted that all these methods adopt an iterative strategy with different iterations.For a fair comparison and to show the improvement of our graph interaction model directly, in ablation study, the number of iterations of all methods is set to 1.For our TIM, we use “no pairwise” to denote that the pro-posed graph interaction model is not used.We can observe that the graph interaction model always improves the matching performance with negligible deterioration in the processing time.Especially on the CPC dataset, the graph interaction model brings a huge performance boost for our TIM.

TABLE IV ABLATION STUDIES WITH EACH COMPONENT OF OUR TIM.BOLD INDICATES THE BEST RESULTS

b)Affine invariant by PCAW: PCAW normalization technique makes our proposed topology-aware relationship affine invariant.It provides limited performance gains in the CPC dataset, while delivering a huge improvement in EVD dataset where almost all image pairs undergo an extreme view change.

c)Robust transfer error: Compared with (18) that directly measures Euclidean distance as transfer error, the robust transfer error defined in (19) is more stable for scaling changes and feature points density.As can be seen from the “No robust”results in Table IV, robust transfer error brings some performance improvements.

d)Seed selection: As evident from the “No seed” results,the seed selection step is critical to the efficiency of our method since we only compute the neighborhoods of the seed nodes instead of the entire putative set to construct the graph.Usually, the seed nodes size is much smaller than that of the putative set and thus the computation time will be greatly reduced.In addition, using seed nodes yields a cleaner graph structure, as seed nodes tend to have a higher inlier ratio.After graph-cut, a cleaner graph structure will give better results, thus improving the matching performance simultaneously.A visualization of why seed points can improve efficiency and performance of TIM can be found in Fig.14.

Fig.14.Visualization of the impact of seeds on TIM.The left figures show the graph structure constructed by TIM, where the seed nodes and regular nodes are labeled with different markers.The figures on the right display the output of the algorithm.The graph constructed from seed nodes is cleaner and more efficient.

4)Limitations: This study has some potential limitations in three main aspects.First, since we construct the graph structure based on local neighborhoods, if the outlier ratio is too high, such that almost every correspondence is surrounded by outliers, then the resulting graph structure cannot represent the correct geometric relationship, leading to failure.Second, as a local-based approach, TIM is difficult to handle with widely repeating regular structures.In such cases, one or several independent groups of erroneous matches may appear,exhibiting distinct non-random patterns within a local scope.Global methods such as CLNet are able to identify these false correspondences.Third, compared to the current state-of-theart data-driven deep learning algorithms, the performance of TIM is slightly worse on some tasks, e.g., relative pose estimation.To give a more objective evaluation of the proposed TIM, we report two failure cases in Fig.15 on the YFCC dataset.

V.CONCLUSION

In this paper, we introduce TIM, a novel method to filter outliers from putative correspondences.Unlike previous work that discriminates each correspondence independently based only on locality, the proposed graph interaction model formulates the outlier filtering task as a binary labeling energy minimization problem and models the pairwise interaction between correspondences.It always boosts the matching performance of locality preserving based methods.We further design a topology-aware relationship that explicitly models topology consistency and is affine invariant.With the topology-aware relationship, the constructed graph structure is robust and geometrically meaningful.Extensive experiments on several large and diverse datasets for multiple vision tasks show that, integrating these two novel components, the proposed TIM is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness.

Fig.15.Failure cases of our TIM.Correspondences consistent with groundtruth epipolar geometry are shown in blue, otherwise in red.The main failure cases of our TIM are the extremely high outlier ratio and the wide repeated structures along the scene.