Interactive medical image segmentation with self-adaptive confidence calibration∗#

2023-09-21 06:31:08ChuyunSHENWenhaoLIQisenXUBinHUBoJINHaibinCAIFengpingZHUYuxinLIXiangfengWANG

Frontiers of Information Technology & Electronic Engineering 2023年9期

Chuyun SHEN ,Wenhao LI ,Qisen XU ,Bin HU ,Bo JIN,Haibin CAI,Fengping ZHU,Yuxin LI,Xiangfeng WANG‡

1School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

2Huashan Hospital, Fudan University, Shanghai 200040, China

3Software Engineering Institute, East China Normal University, Shanghai 200062, China

Abstract: Interactive medical image segmentation based on human-in-the-loop machine learning is a novel paradigm that draws on human expert knowledge to assist medical image segmentation.However,existing methods often fall into what we call interactive misunderstanding,the essence of which is the dilemma in trading offshort-and long-term interaction information.To better use the interaction information at various timescales,we propose an interactive segmentation framework,called interactive MEdical image segmentation with self-adaptive Confidence CAlibration (MECCA),which combines action-based confidence learning and multi-agent reinforcement learning.A novel confidence network is learned by predicting the alignment level of the action with short-term interaction information.A confidence-based reward-shaping mechanism is then proposed to explicitly incorporate confidence in the policy gradient calculation,thus directly correcting the model’s interactive misunderstanding.MECCA also enables user-friendly interactions by reducing the interaction intensity and difficulty via label generation and interaction guidance,respectively.Numerical experiments on different segmentation tasks show that MECCA can significantly improve short-and long-term interaction information utilization efficiency with remarkably fewer labeled samples.The demo video is available at https://bit.ly/mecca-demo-video.

Key words: Medical image segmentation;Interactive segmentation;Multi-agent reinforcement learning;Confidence learning;Semi-supervised learning

1 Introduction

Medical image segmentation is a crucial task in computer-assisted medical diagnosis.However,traditional convolutional neural networks (CNNs)(Wang et al.,2018,2019;Liao et al.,2020) used in segmentation algorithms often struggle to meet clinical standards due to the variability of pathological conditions,the presence of dark lesion areas,and the inconsistent quality of training data.Factors such as discrepancies in imaging scanners,operators,and annotators make the task of medical image segmentation more challenging.To improve accuracy,interactive image segmentation algorithms (Lin D et al.,2016;Xu et al.,2016;Rajchl et al.,2017;Bredell et al.,2018;Wang et al.,2018,2019;Liao et al.,2020;Ma et al.,2021) have been developed using interactive correction information,such as clicks,scribbles,or bounding boxes,to refine the results.

Fig.1 illustrates the general interactive segmentation process,which contains two modules,i.e.,an interactive module and a utilization module.In the interactive module,users or experts provide interactive correction information,such as clicks and scribbles.The utilization module then uses this interactive correction information to efficiently refine the previous result.Interactive segmentation algorithms have the potential to outperform traditional algorithms by incorporating additional interactive correction information.The interaction process can be viewed as a continuous collaboration between the model and a human expert.Therefore,an effective interactive segmentation model should be able to incorporate and respond to expert feedback to improve its performance through collaboration.

Fig.1 Interactive segmentation process.In the interactive module,the expert observes the current (or initial) segmentation and provides correction information (the red hint);in the utilization module,new segmentation is refined based on the correction information.References to color refer to the online version of this figure

Methods such as BIFSeg (Wang et al.,2018) and DeepIGeoS (Wang et al.,2019) model user interactions as hard constraints through conditional random fields.However,these models focus more on one-step interaction,and refine segmentation after the first step cannot be efficiently used (Liao et al.,2020;Ma et al.,2021).Thus,these models cannot use long-term interaction information efficiently.The follow-up InterCNN (Bredell et al.,2018) and other methods (Liao et al.,2020;Ma et al.,2021) improve DeepIGeoS and BIFSeg,and model the problem as an iterative interaction problem,which is more focused on multi-step interaction.Therefore,these methods can cover the data distribution corresponding to different interaction levels in the training phase to effectively use long-term interactions,while ignoring the stochasticity or uncertainty of the model,which makes it difficult for them to use short-term interaction information effectively.

Existing interactive methods cannot effectively use short-and long-term interaction information simultaneously,which leads to interactive misunderstanding.Fig.2 presents this phenomenon when implementing the popular interactive segmentation algorithm InterCNN (Bredell et al.,2018) on the BraTS2015 dataset.The algorithm ignores the expert’s correction information (Fig.2a),and is even adversely affected (Fig.2b).These inconsistencies between hint information and the refinement results indicate that existing algorithms still face the critical challenge of inefficient utilization of interactive correction information.The reason for the segmentation refining failure is that,at the end of the interaction process,the total loss of the model will guarantee the main area’s priority and ignore some small but challenging areas,such as edges.The regions that are insensitive to interactive correction information could be considered hard-to-segment regions (Shrivastava et al.,2016;Nie et al.,2019),leading to suboptimal refinement of the segmentation process.If these regions are not given additional attention during the training phase,the segmentation model may prioritize easy-to-segment areas at the expense of these more challenging regions.This problem is more severe for medical images,where the hard-tosegment regions usually are tumor boundaries and are very important for clinical diagnosis and surgery.Therefore,it has become urgent to improve the utilization of correction information,especially for the hard-to-segment regions.

Fig.2 Segmentation refining failure after long-term interactions on BraTS2015 with InterCNN: (a) the model cannot fully understand or ignore the hint information;(b) the model could misunderstand the correction,which results in a worse result

Another significant challenge in the development of interactive segmentation algorithms is the need for a large number of labeled images.Accurate annotation of medical images can be time-consuming and costly,making it difficult to obtain a sufficient quantity of high-quality data.On the other hand,it is much more cost-effective to obtain a larger number of unlabeled images,but existing interactive segmentation methods do not use these low-cost resources.As a result,it is important to explore how to use the benefits of unlabeled data to reduce the dependence on expert-annotated images.

In this paper,we propose a novel interactive segmentation algorithm for three-dimensional (3D) medical images,called interactive MEdical image segmentation with self-adaptive Confidence CAlibration (MECCA).MECCA combines an action-based confidence learning network with the multi-agent reinforcement learning (MARL) framework (Zhang KQ et al.,2021).The action-based confidence network can evaluate the corrective action quality by directly calculating the confidence.Unlike learning the confidence of the overall segmentation result,the confidence of actions could be used to better evaluate whether the segmentation model has correctly used the experts’ interaction information.We formulate the iterative interaction process as a Markov decision process (MDP) to model the dynamic process and further introduce the MARL technique (Lee and Song,2018).Further,instead of setting each image or each patient (a series of images) as the agent,we consider each voxel (the smallest unit in the 3D space) as an agent,while each agent aims to learn the segmentation policy and makes its decision (Furuta et al.,2020).Specifically,after receiving the interactive correction information,each agent will modify its label by changing (increasing or decreasing) the category probability.The novel action-based confidence network will directly evaluate each agent’s corrective action obtained from the MARL corrective policy.Using this action-based confidence network,two additional techniques are further proposed to improve the utilization efficiency of the interactive correction information: (1) compared with former manual rewards,a self-adaptive reward function of each action is constructed which could provide more meticulous feedback under a flexible framework;(2) a simulated label generation mechanism is established using the interactive correction information as the weakly supervisory signal.Combining the confidence network,the simulated label generation mechanism can approximately generate the labels for unsupervised images and reduce over-reliance on labeled data.

The main contributions of this work are summarized as follows:

1.A novel framework is proposed for interactive medical image segmentation by combining the action-based confidence learning network with MARL.

2.A self-adaptive feedback mechanism (selfadaptive reward) is constructed with the actionbased confidence network to alleviate interactive misunderstanding during the interaction process.

3.Simulated supervisory signals can be generated based on the confidence learning network and actions;hence,much fewer labeled data (ground truth data) are needed to achieve the same performance.

2 Related works

2.1 Deep interactive image segmentation

Xu et al.(2016) segmented images based on CNN interactively.DeepCut (Rajchl et al.,2017) and ScribbleSup (Lin D et al.,2016) both employ weak supervision to establish interactive image segmentation methods.DeepIGeoS (Wang et al.,2019) employs the geodesic distance metric to construct a hint map.The interactive segmentation process can be considered a sequential process which is natural to model by reinforcement learning (RL).Polygon-RNN (Castrejón et al.,2017) fundamentally segments each target as a polygon and iteratively chooses the polygon vertexes through a recurrent neural network (RNN).Polygon-RNN++(Acuna et al.,2018) employs almost the same idea as Polygon-RNN but learns to choose vertexes by RL.SeedNet (Lee and Song,2018) trains an expert interaction generation RL model that obtains new simulated interaction information at each interaction step.IteR-MRL (Liao et al.,2020) and BS-IRIS (Ma et al.,2021) both model the dynamic interaction process as an MDP and employ MARL models to segment images.Some researchers aimed to reduce the annotation cost of interactive image segmentation.IFSL (Feng et al.,2021) introduces interactive learning into the few-shot learning strategy and addresses the annotation burden of medical image segmentation models.IOG (Zhang SY et al.,2020) uses a practical inside-outside guidance approach to minimize the labeling cost.These interactive methods can hardly use experts’short-and long-term interaction information simultaneously,thus creating error correction operations.

2.2 Uncertainty image segmentation

Uncertainty estimation is helpful in the context of deployed machine learning systems because it can detect when a neural network is likely to make an incorrect prediction or when the input may be out of distribution.Recent methods generally adopt the neural network to learn what the uncertainty should be for any given input,i.e.,learningbased uncertainty estimation or confidence learning,as demonstrated in Kendall and Gal (2017),Moeskops et al.(2017),DeVries and Taylor (2018b),Hung et al.(2018),Robinson et al.(2018),Jungo and Reyes (2019),and Nie et al.(2019).These methods commonly consist of a segmentation network and a confidence network,and are more computationally efficient than other techniques.Thus,they are better suited when computational resources are limited or when real-time inference is required,such as the interactive segmentation scenario considered in this paper.Specifically,Kendall and Gal (2017) introduced a confidence network to predict the aleatoric and epistemic uncertainties by imitating classic Bayesian tools.The segmentation network of DeVries and Taylor (2018a,2018b) produces two separate outputs: prediction probabilities and a confidence estimate.Confidence estimate is motivated by interpolation between the predicted probability distribution and the target distribution during training,where the degree of interpolation is proportional to the confidence estimate.A series of works (Moeskops et al.,2017;Hung et al.,2018;Nie et al.,2019) focus on incorporating uncertainty estimation in the adversarial learning process,where the segmentation network corresponds to the generator,and the confidence network is the discriminator accordingly.Moeskops et al.(2017) first employed generative adversarial networks (GANs) to improve the CNN-based brain magnetic resonance imaging (MRI) segmentation method.The semi-supervised learning technique was used in Hung et al.(2018) to predict trustworthy regions in unlabeled images.Nie et al.(2019) proposed a difficulty-aware attention mechanism to handle those difficult samples or challenging regions.Different from the previous work on learning uncertainty through imitation,joint training,or adversarial learning,a simple but powerful alternative is to introduce an auxiliary task,such as predicting the overlap between a proposed segmentation and its ground truth (Robinson et al.,2018),or to predict the voxel-wise false positive and false negative (Jungo and Reyes,2019).

2.3 Remark

In our proposed algorithm,the confidence network should evaluate the confidence of calibrating actions instead of the segmentation results,which is the most significant difference between MECCA and previous uncertainty estimation methods.Therefore,we design a novel action-oriented auxiliary task to predict whether the direction of voxel-wise action is consistent with the ground truth.

3 Proposed algorithm

This section introduces the proposed interactive segmentation algorithm,MECCA,which can iteratively evaluate the refinement actions and feedback to the segmentation model.The algorithm framework follows the MARL structure (Sections 3.1 and 3.2),and an action-based confidence learning module (Section 3.3) is introduced to evaluate the confidence of corrective actions.This action-based confidence learning module can be used to establish the self-adaptive reward scheme and simulated label generation mechanism to use the interactive correction information efficiently.The architecture overview of MECCA is depicted in Fig.3a.The model’s state information includes the original 3D image,the previous segmentation probability,and the hint map generated from interaction and confidence maps.Based on the current state information,the segmentation module gives suggested actions to refine previous segmentation results by adjusting the segmentation probability of each voxel (agent).Further,the state information will be used to evaluate the confidence of the obtained actions through the confidence network,with a confidence map as the output.The self-adaptive reward is designed through a self-adaptive weighting scheme based on the action confidence evaluation (Section 3.4).The self-adaptive reward map can be considered a value map with the same size as the original input image (each agent has its self-adaptive reward).It can reflect the performance of the corresponding agent’s action.In addition,MECCA will suggest some lowconfidence regions for the user to segment during the interaction (Section 3.5).The confidence map can also be used to generate the simulated label by comparison with actions,which will be described in Section 3.6.The newly obtained hint map,the adjusted segmentation probability result,and the original 3D image form a new state.The process described above is repeated until the segmentation result meets the requirements.To emphasize,during the testing stage in Fig.3b,there is no need to calculate the selfadaptive rewards,and at the same time,we need only the obtained actions and suggested interaction areas.

Fig.3 The architecture (a) and testing stage (b) of MECCA.In (a),the segmentation module outputs actions to change the segmentation probability of each voxel (agent) at each interaction step.Meanwhile,the confidence network will estimate the confidence of the actions,which will generate the self-adaptive reward and simulated label.The confidence map can provide the advice regions of the next interaction step to experts.In (b),all voxels in the segmentation probability map are initialized at a fixed value (set to 0.5 in the experiment).Users or experts randomly mark hint points according to the initialized segmentation probability map

3.1 MARL-driven interactive segmentation

We employ the MARL structure to formulate the interactive segmentation process and continuously give error-corrective actions at each interaction step.Letx=(x1,x2,···,xN) denote the input image,wherexi(i=1,2,···,N) denotes theithvoxel of the image.In MARL setting,every voxelxiis treated as an agent with its own refinement policyAt time stept,agentxiobtains actionfrom the segmentation network according to its current state.After taking the action,the agent will receive a rewardaccording to the segmentation result.

3.2 Segmentation network

The segmentation network adopts P-Net (Wang et al.,2019) as the backbone.It has two heads: a policy head and a value head.Two heads share the first three 3D convolutional blocks to extract low-level features.Each of the blocks has two convolution layers,and the size of the convolution kernel is fixed as 3×3 in all these convolution layers.Dilated convolution is employed in all the convolution kernels,which can reduce the loss of resolution.Both heads have another two 3D convolutional blocks to extract specific high-level features.Fig.4 shows the detailed architecture.Specifically,the policy head outputs policy,which is the distribution of action probabilities under the current state.By taking actions on different scales,the probabilities will be dynamically adjusted.The value head estimates the value of the current state,reflecting how good the current state is and estimating the expected return:

Fig.4 Architecture of the segmentation network.The confidence network is the same as the value branch of the segmentation network,but the parameters are not shared

whereTis the terminal time step in the interaction process,andγis the discount factor.θvdenotes the parameters of the value head,and the gradient with respect toθvis

The policy head’s goal is to maximize the expected return by selecting proper actions in states(t).We useθpto denote the parameters of the policy head,and the gradient forθpis denoted as

Usually,the policy head is updated more slowly.

3.3 Action confidence learning

As we mentioned in Section 1,there will be some situations where the segmentation model misunderstands or ignores the hint information.To some extent,these samples (or regions) with the phenomenon of interactive misunderstanding are hard samples (or regions).Although these samples may account for a small portion of the dataset,they are critical for improving generalization and robustness.The most important thing is finding a professional easy-or-hard representer (Nie et al.,2019) to identify them during the interaction.Focal loss (Lin TY et al.,2017) evaluates the easy-or-hard samples through a predicted probability.Nie et al.(2019) applied adversarial learning to train the easy-or-hard representer.Each representer has its advantages,but all evaluate easy-or-hard samples based on the final segmentation result.As such,these methods cannot be directly applied to interactive segmentation.For example,if the model predicts the category probability of a voxel to be 0.8,and then takes an actiona(t)i=-0.1,the next prediction will be 0.7 after interactionIfyi=1,the results are both correctly predicted becauseare both larger than 0.5,whereas for interactive segmentation,the probability is changing in the wrong direction.This change is what we call the interactive misunderstanding phenomenon,and the formal definition is shown in the following:

Definition 1(Interactive misunderstanding) For a binary classification problem,the sign of the foreground labely=1 is denoted as positive,and the sign of the background labely=0 is denoted as negative.In an interactive medical image segmentation task (i.e.,a voxel-wise binary classification problem),for any voxeli,if the sign of the change of the segmentation probability outputted by the algorithm for two consecutive interaction steps,sign (Δ(p(i))),is not equal to sign (yi),then this phenomenon is defined as interactive misunderstanding.

The confidence network learns the confidence of the given actions to avoid misunderstanding of hint information and to take accurate actions.We argue that confidential information can be used to regularize action choices and suggest more efficient interaction.The confidence network structure also uses P-Net as the backbone.The confidence network contains six 3D convolutional blocks.Each of the blocks has two convolution layers,and the size of the convolution kernel is fixed as 3×3 in all these convolution layers.The detailed architecture is also shown in Fig.4.

The confidence network is trained using the previous state and action as input and a confidence map as output.The confidence network is optimized by minimizing the summation of binary cross-entropy loss over actions (shown in Eq.(6)) at each time stept.Here we useCto denote the confidence network,wCthe confidence network parameters,andLBCEthe binary cross-entropy loss:

andg(t)indicates whether the direction of the action is consistent with the label.Fora ⊕b,the statement is true only if eithera>0 orb>0.

One potential issue when training the confidence network is the imbalance of samples.Early in RL training,the error rate of actions outputted by the segmentation network is high,whereas most actions are correct when the network gradually converges.Inspired by discriminator training in GANs,this study introduces symmetric samples into Eq.(6) to speed up training.An obvious advantage is that it improves sample utilization efficiency because the confidence network can know what a“bad sample” is and obtain a corresponding“good sample.”

3.4 Self-adaptive reward

Essentially,no matter whether focal loss or adversarial learning is applied to train the segmentation model,the objective is to weigh these hard samples to prevent the model from being dominated by easy samples.However,the interactive segmentation task differs from those fully automatic segmentation tasks because the interactive segmentation model needs to cooperate with the user and understand the user’s hint information.The action to refine the segmentation result shows how the segmentation model understands hint information.Therefore,it is necessary to ensure that the hint information is correctly understood and that the correct action is taken.

Specifically,the previously described actionconfidence learning can provide the segmentation model with a confidence map to alleviate interactive misunderstanding.Using this confidence map,easy-or-hard samples can be better recognized because the confidence values for these “hard regions”are lower than those for other regions.We formulate this action-aware learning as the self-adaptive reward function,r(t),which is shown in Eq.(8),to adapt this mechanism to MARL training:

3.5 Interaction guide

Another challenge for interactive image segmentation is that users usually need to decide where to interact in many slices,which is very timeconsuming,especially for 3D images.Our framework provides users with an interaction guide mechanism to save user interaction time.After refinement,our framework will suggest some possible areas for users to segment during the interaction.Specifically,our framework will filter out those areas with low action confidence and provide them to users (Fig.5).First,the original 3D image will be segmented with super voxels;each super voxel can be regarded as a group of voxels that share common characteristics.We use a simple linear iterative clustering (SLIC) (Achanta et al.,2012) technique with spacing=[2,2,2],compactness=0.1 to generate super voxels,and the number of super voxels equals 100 at the beginning and gradually declines during the refinement iterations for training and testing.Second,the proposed algorithm will compute the mean action confidence in each super voxel and rank them in descending order.Finally,the top five super voxels will be marked and recommended to users.Users need to select the best interaction positions from these super voxels.

Fig.5 Interaction guide mechanism: (a) probability map;(b) action map;(c) confidence map;(d) advice region;(e) interaction point.The areas surrounded by the green lines are advice regions provided to users,and the red point in (e) is the real hint information selected from advice regions.The brighter the color (closer to yellow),the larger the positive value;conversely,the darker the color (closer to black),the smaller the negative value.References to color refer to the online version of this figure

3.6 Simulated label generation

In medical imaging,there are much more unlabeled data than labeled data due to difficulties in labeling medical images.To address this lack of annotations,the proposed algorithm leverages the action confidence not only to improve the utilization efficiency of hint information but also to generate a simulated label (Fig.6) for unlabeled data.We define the simulated label as follows:

Fig.6 Illustration of simulated label generation.The simulated label generation mechanism uses the confidence map and the action map to generate the simulated label.The confidence map is used to calibrate the action,and the direction of the calibrated action is the simulated label of each voxel

whereM=I(max (c(t),1 -c(t)) >δ),andis the advantage (defined in Mnih et al.(2016)) at time steptof takinga(t)in the condition of states(t),indicating the actual accumulated reward without being affected by the state and reducing the gradient variance.A mask,M,is used to constrain the training of unlabeled data.The backward gradients of unlabeled data occur only when the action confidence exceeds thresholdδ(which is gradually increased during the training process).Unlike traditional pseudo-label training,the supervised signal does not come from the segmentation network but from the confidence network.These filtered data with hint information are more valuable and provide more accurate supervised signals.Generally,the training processes with labeled data and simulated labeled data are carried out simultaneously.When using the simulated label generation mechanism,the pseudocode is as shown in Algorithm 1.

4 Experiments and results

4.1 Datasets and implementation details

To comprehensively evaluate our proposed method,we apply our algorithm to four 3D medical image datasets.All datasets are divided into two parts:Dtrain/Dtest.The details of these datasets are as follows: (1) BraTS2015: Brain Tumor Segmentation Challenge 2015(Menze et al.,2015) contains 274(234/40) multi-parametric MRI images (Flair,T1,T1C,T2) from brain tumor patients.In our task,we use only the Flair image and segment the whole brain tumor.(2) BraTS2020: Brain Tumor Segmentation Challenge 2020 (Menze et al.,2015) contains 285(235/50) multi-parametric MRI images (Flair,T1,T1C,T2) from brain tumor patients.In our task,we use only the Flair image and segment the whole brain tumor.(3) MM-WHS: Multi-Modality Whole Heart Segmentation (Zhuang and Shen,2016) contains 24(20/4) multi-modality whole heart images covering the whole heart substructures.In our task,we choose to segment the left atrium blood cavity.(4) Medical Segmentation Decathlon: This is a generalizable 3D semantic segmentation dataset containing different organ segmentation tasks (Simpson et al.,2019).We use the spleen and liver datasets,which provide 61 (41/20) and 106 (96/10) CT images,respectively.

We implement our method with PyTorch (Paszke et al.,2019).The demo video of MECCA is available at https://bit.ly/mecca-demo-video.The segmentation and confidence networks are both initialized using the Xavier (Glorot and Bengio,2010) method,and the learning rate is initialized to 1e-4.Other parameters are set as follows:T=5,γ=0.95,α=0.8,β=1.The maskMranges from 0.85 to 0.99 and increases by 0.000 25 at every epoch.Adam (Kingma and Ba,2015) is adopted as the optimizer.The original image is cropped by the bounding box based on the ground truth with a random extension in the range of 1-11 voxels.Each image is then resized and normalized to 55×55×30.The data are augmented by random flips and rotations.The proposed algorithm’s training time with one Nvidia 2080ti GPU varies from 5 to 13 h for different datasets.

4.2 Experimental settings

During the process of interaction,we adopt the edge points as the hint information for two reasons.On one hand,click operation saves more time than scribbles or other interactive methods.On the other hand,edge points can provide more information about object edges because the bottleneck of medical image segmentation is usually the inaccurate segmentation of object edges.Specifically,we provide each method with 45 points during the whole interaction process.At every step,users will click some edges that are not correctly predicted.We generate a 3D Gaussian (with a kernel size of eight voxels) centered on each of the edge points as the hint map to let networks receive hint information.Then the hint map will be inputted to the segmentation network as part of the state.We use the Dice score and the average symmetric surface distance (ASSD) to evaluate the segmentation results.According to these evaluation metrics,doctors can judge the patient’s condition:

whereSpandSgdenote the prediction of an algorithm and the ground truth,respectively.

whereSaandSbrepresent the sets of surface points of the segmentation results predicted by the algorithm and the ground truth,respectively.d(i,Sb) is the shortest Euclidean distance betweeniandSb.The Dice and ASSD in all tables are the average values of five algorithm test results.

4.3 Comparisons with baselines

We compare MECCA with four state-of-the-art interactive segmentation methods: DeepIGeoS (Wang et al.,2019),InterCNN (Bredell et al.,2018),IteR-MRL (Liao et al.,2020),and BS-IRIS (Ma et al.,2021).InterCNN is the multi-step version of DeepIGeoS.We also introduce a state-of-the-art medical image segmentation method,U-Net (Ronneberger et al.,2015),as a baseline.Table 1 shows the quantitative comparison of the seven segmentation methods on different datasets.For a fair comparison,except U-Net,all CNN-based methods adopt the same network structure (P-Net),which was proposed in Wang et al.(2019).In particular,U-Net and P-Net are methods without hint information.We can see that MECCA performs better than or is on par with the state-of-the-art methods on three datasets.We visualize the results in Fig.7,which shows that our method achieves more accurate results in edge segmentation.

Table 1 Quantitative comparison of different methods

Fig.7 Visualization of segmentation results by different methods.The green lines represent the boundaries of the ground truth,and the yellow lines represent the predicted boundaries.References to color refer to the online version of this figure

To demonstrate that MECCA can take advantage of hint information more efficiently,we compare the relative improvement of different methods at each interaction step.Due to the different interaction processes of different methods,we set up a unified interaction process: all these methods have five interaction steps and receive 25 identical points at the first step to generate the initial segmentation result.After that,users will click five points at each step in the following four interaction steps (a total of 25+5×4=45 points),and these methods will iteratively refine previous segmentation results.The DeepIGeoS method does not model the interaction sequence,and simply combines current and previous hint information to refine the previous segmentation.The results are shown in Table 2 and Fig.8.

Table 2 Dice scores at each interaction step by different methods

Fig.8 The performance improvement of different interactive medical segmentation methods at different interaction steps.All these testing results were obtained on the BraTS2015 dataset

The results show that MECCA performs better under the same amount of hint information and improves the Dice score more after the five steps.Compared with CNN-based methods,the main advantage of RL-based methods is that they can always maintain notable improvement.There are two reasons for this result.The first one is that RLbased methods model the whole interaction process to avoid interaction conflict.The second reason is the relative entropy-based reward which encourages the model to keep refining results.However,we should realize that the RL-based methods still cannot guarantee the high confidence of the corrective actions.As we can see in Fig.8,the performance of IteRMRL is not as good as those of other methods at the beginning,which is caused by numerous incorrect actions.However,after learning the confidence of actions and applying the self-adaptive reward to update the segmentation model,our proposed MECCA can perform well at each step and continue significantly refining the result.

4.4 Comparison of different rewards

One main contribution of our work is that the self-adaptive reward mechanism is further proposed based on the confidence map.This mechanism makes different actions have different feedback levels,so the segmentation network can identify wrong actions as much as possible.To measure the impact of different means of reward weighting on the segmentation result,we compare MECCA with the original IteR-MRL without a weighting reward,Focal-Reward IteR-MRL (Focal-IteR-MRL),and the one with a weighted reward in segmentation error regions (Err-IteR-MRL).Focal-IteR-MRL adopts the idea of focal loss (Lin TY et al.,2017):

The reward function of Err-IteR-MRL is a linear scaling of the IteR-MRL reward:

whereλiis a positive real number hyperparameter.The performance of different weighting rewards is shown in Fig.9.It can be seen that the actionbased reward weighting method is more suitable for interactive segmentation.Also,this study visualizes different masks that weigh the primary reward in Fig.10.As we can see,the focal mask pays more attention to regions with low prediction probabilities,and the error region mask contains all regions segmented incorrectly in the previous result.It is not enough for the focal mask to obtain more structured information by considering only the probability of prediction.As such,the performance of Focal-IteRMRL is erratic,even the poorest on the left atrium dataset.The performance of Err-IteR-MRL is more stable,but the improvement is not notable.This is because the weighting process of Err-IteR-MRL is based on the segmentation result while the refinement process of interactive image segmentation is based on actions.

Fig.9 Average Dice scores of methods with different weighting rewards.The reward function of IteR-MRL is not weighted;the reward function of Focal-IteRMRL is weighted via Eq.(12),the reward function of Err-IteR-MRL is weighted in segmentation error regions,and the reward function of MECCA is weighted through action confidence

Fig.10 Visualization of the original image (a),ground truth (b),action map (c),self-adaptive mask (d),error region mask (e),and focal mask (f).All masks are obtained at the first interaction step.The brighter the color (closer to yellow),the larger the positive value;conversely,the darker the color (closer to black),the smaller the negative value.References to color refer to the online version of this figure

4.5 Weakly supervised segmentation

As mentioned in Section 3.6,MECCA can reduce the dependence on voxel-level annotations of images using the simulated label generated from the action confidence map.This study validates MECCA on the BraTS2015 dataset by randomly selecting different proportions of samples as fully labeled data and using the rest of the training images as unlabeled data,which provide only hint information during the interaction.This study compares MECCA against P-Net (without hint),DeepIGeoS,and IteR-MRL (Liao et al.,2020).However,only MECCA uses unlabeled data,and the three baselines use only a fixed proportion of labeled data.This does not fully demonstrate the performance of MECCA on semi-supervised problems.For this reason,we introduce a state-of-the-art semi-supervised method UA-MT (Yu et al.,2019).Table 3 shows the results of different methods.MECCA achieves a Dice score of 87.14% with only 12.5% labeled data,and it performs better than the other methods with 25%labeled data.

Table 3 Quantitative comparison between MECCA and other methods on the BraTS2015 dataset with different portions of labeled data

Because UA-MT is not an interactive segmentation algorithm,its absolute performance is worse than those of interactive segmentation baselines but better than that of the non-interactive method PNet.In addition,it can be seen from the last column of Table 3 that UA-MT has the most negligible performance loss with different proportions of labeled data.After MECCA introduces the simulated label generation mechanism,the performance loss is similar to that of UA-MT.Interestingly,DeepIGeoS can maintain low performance loss without using a semi-supervised learning mechanism.Considering the performance loss of IteR-MRL,we can conclude that in the semi-supervised interactive segmentation task,the interactive misunderstanding phenomenon will exacerbate the performance loss caused by missing data.DeepIGeoS and MECCA alleviate the interactive misunderstanding phenomenon through hard constraints and self-adaptive confidence calibration,respectively,so both of them can achieve better results in semi-supervised settings.However,note that DeepIGeoS does not consider multi-step interactions or the relationship between consecutive interactions.It cannot fully use long-term interaction information,so there is a large gap with MECCA in absolute performance.

We also evaluate MECCA with different numbers of interactions and percentages of labeled data.Based on the experimental results (Table 4),we conclude the following: (1) if there are enough labeled data,MECCA can achieve good results after a few interactions;(2) the number of interactions required by MECCA to achieve the same performance is roughly inversely proportional to the portion of labeled data;(3) although MECCA requires more interactions (about 2-3 times) when only part of the labeled data is available,it can approach the performance of the algorithm trained with all data being labeled in the end.

Table 4 Dice scores with different numbers of interactions and percentages of labeled data

4.6 Ablation studies

Table 5 shows the impact of different mechanisms on MECCA.It is clear that the simulated label generation mechanism and self-adaptive reward mechanism significantly improve the algorithm’s performance.The simulated label generation mechanism can achieve a 1.70% and 2.26% gain over the initial algorithm without any mechanism for the BraTS2015 dataset and the Liver dataset in Medical Segmentation Decathlon,respectively.The main reason for this significant improvement is the utilization of these unlabeled data.

Table 5 Ablation study of our proposed algorithm on BraTS2015 and Liver datasets with 25% labeled data

Table 6 shows the comparison of interactive segmentation methods in computational and interaction time.The results show that MECCA takes only about half the time of other methods when performing an interaction.The main contribution to the reduction of interaction time is the interaction guide mechanism mentioned.Furthermore,to explore the efficiency of MECCA in authentic tasks,we design an interactive software platform.We ask oncologists to use it for natural interactive segmentation,and the results are shown in Table 7.It can be seen from the results that the MECCA algorithm also takes only about half the time of the baselines due to interaction guidance.In addition,as the segmentation results become more and more accurate,the interaction time required gradually decreases.

Table 6 Comparison of computational and interaction time among different interactive segmentation methods

Table 7 Quantitative comparison of interaction time among different interactive segmentation methods with different numbers of interactions on a realistic segmentation platform operated by experts

5 Conclusions

We present a novel action-based confidence learning method for interactive 3D image segmentation.Specifically,we propose a method for learning the confidence of actions that continuously refine the segmentation result during the interaction process,so that hint information can be used more effectively.Based on this process,a self-adaptive reward is proposed for the segmentation module,which can prevent the misunderstanding phenomenon during the interaction and help reduce the time cost of interaction by providing users with the advice regions with which they should interact next.In addition,the confidence map can replace the ground truth to generate feedback for the unlabeled samples for the segmentation module.These samples,without voxel-level annotations,can also be used to model training.By integrating these components,MECCA is demonstrated to improve medical segmentation efficiently using fewer annotated samples.

Although our method can greatly enhance the utilization of short-and long-term interaction information with significantly fewer labeled samples,there are also certain limitations that we need to address.One of the primary challenges in medical image segmentation is the high dimension and resolution of medical images,leading to a requirement for numerous interactive clicks.Specifically,Wang et al.(2018) provided the actual use time collected from real physicians.Thus,there is a need for more efficient interactive methods,and Aljabri et al.(2022) introduced the currently available annotation tools for medical imaging.In future work,we may incorporate more efficient annotation methods to minimize the cost of physician interactions.

Further,RL from human feedback (RLHF),which is used in ChatGPT (OpenAI,2022),is increasingly popular.This approach collects comparison data and trains a reward model,where the data are human rankings of the model’s outputs from best to worst.RLHF is another application that benefits from human feedback,and in future work,we may consider another interactive method in which radiographers rank the different outputs as reward signals.Another limitation in the field of medical image segmentation is the presence of multiple centers,where data are collected from different devices,resulting in varying distributions.With such non-independent and identically distributed (non-IID) data,the generalization ability of deep learning based models is limited (Li et al.,2021;Ye et al.,2022).To overcome this,our proposed MECCA method may require additional design;e.g.,Aljabri et al.(2022) introduced a novel robust weakly supervised learning paradigm.

Finally,as technology continues to advance,changes in patient distribution and the introduction of new medical equipment may necessitate the development of online learning methods to ensure that the algorithm remains robust.Popular methods such as iCaRL (Rebuffiet al.,2017),GDumb (Prabhu et al.,2020),and iCaRL (Rebuffiet al.,2017) maintain a buffer to store the most important samples for the model.Specifically,in the RL area,we can borrow the ideas and theoretical justifications from lifelong RL,in which setting agents must solve a series of related tasks drawn from a task distribution rather than a single,isolated task (Abel et al.,2018).Abel et al.(2018) explored which knowledge should be transferred in lifelong RL through two simple families of results.Xie et al.(2020) leveraged latent variable models to learn a representation of the environment from experience and performed off-policy RL with this representation.

Contributors

Chuyun SHEN,Wenhao LI,and Qishen XU designed the research and conducted the experiments.Bin HU,Fengping ZHU,and Yuxin LI ensured the validity of the experiments.Bo JIN,Haibin CAI,and Xiangfeng WANG offered support across various experimental aspects.Chuyun SHEN drafted the paper.All the authors revised and finalized the paper.

Compliance with ethics guidelines

Chuyun SHEN,Wenhao LI,Qisen XU,Bin HU,Bo JIN,Haibin CAI,Fengping ZHU,Yuxin LI,and Xiangfeng WANG declare that they have no conflict of interest.

Data availability

The demo video is available at https://bit.ly/meccademo-video.The other data that support the findings of this study are available from the corresponding author upon reasonable request.

List of supplementary materials

1 More related works

2 More visualizations

3 Robustness of MECCA

4 Comparison of baseline responses to the same user interaction

Fig.S1 MECCA segmentation process

Fig.S2 Qualitative segmentation results of MECCA for the BraTS2015 validation set

Figs.S3-S5 Results of different methods’ responses to the same user interactions according to the same initial segmentation on different testing instances and different channels for the Liver dataset in Medical Segmentation Decathlon

Table S1 Dice of our method which varies with the number of interactions under different cases

Table S2 MECCA’s tolerance to inaccurate interaction points

Frontiers of Information Technology & Electronic Engineering2023年9期

Frontiers of Information Technology & Electronic Engineering的其它文章: Compact millimeter-wave air-filled substrate-integrated waveguide crossover employing homogeneous cylindrical lens＊#; Pattern reconfigurable antenna array for 5.8 GHz WBAN applications＊; Impact of distance between two hubs on the network coherence of tree networks∗; A distributed EEMDN-SABiGRU model on Spark for passenger hotspot prediction∗#; Mixture test strategy optimization for analog systems∗#; LDformer: a parallel neural network model for long-term power forecasting＊