High interpretable machine learning classifier for early glaucoma diagnosis

2021-03-18 07:00CarlosSalvadorFernandezEscamezElenaMartinGiralSusanaPeruchoMartinezNicolasToledanoFernandez

Carlos Salvador Fernandez Escamez, Elena Martin Giral, Susana Perucho Martinez, Nicolas Toledano Fernandez

1Ophthalmology Department, Hospital de Fuenlabrada, Madrid 28942, Spain

2Doctorate Program in Health Sciences, Universidad Rey Juan Carlos, Alcorcon 28922, Madrid, Spain

Abstract

INTRODUCTION

Glaucoma is one of the principal causes of preventable irreversible visual loss. Its diagnosis and early treatment are critically important in order to preserve visual function[1]. However, detection of glaucoma in its early stages can be challenging due to the wide variation in the normal appearance of the optic nerve[2].

The newest optical coherence tomography (OCT) devices have enabled the detection of subtle defects in the retinal nerve fiber layer (RNFL) that were previously undetectable through clinical examination[3]. The OCT yields a huge amount of numerical data, whose combined analysis can be performed using machine learning methods. Among these methods, some have the property of interpretability which allow the comparison of the results with prior knowledge, as well as discovering how the algorithm works to obtain the result[4].

The goal of the present study was to develop and validate an interpretable classifier able to distinguish between healthy eyes and early glaucoma by means of the peripapillary RNFL thickness measured with OCT. The newest tree gradient boosting (GB) methods were used, for which there exists a GB explainer extension that allowed to achieve a high level of interpretability of the model.

SUBJECTS AND METHODS

Ethical ApprovalThe study had the approval of the Institutional Review Board of the hospital with approval number APR 18/16 and adhered to the principles of the Declaration of Helsinki. Informed consent was required to all subjects. All participants underwent a comprehensive ophthalmic examination.

Study Design and ParticipantsThis was an observational cross-sectional study of consecutive glaucoma patients at Fuenlabrada Hospital institution (Madrid, Spain) during the years 2018 to 2019. Normal subjects were recruited from the population who came to clinic for regular eye examination. Glaucoma patients were selected from those who came to clinic as a first diagnosis or for successive examination and showed only mild visual field (VF) damage.

Spectral domain OCT examination was performed on all subjects using a commercially available equipment (Topcon 3D OCT 2000 MarkII, FastMap v. 3.40, Topcon Corp., Japan). Tomographic images of the peripapillary area were obtained with the three-dimensional disc scan on a measured circle 3.46 mm in diameter. Scans were obtained after pupil dilation with 1% tropicamide. Automatic perimetry with 24-2 Swedish Interactive Threshold Algorithm of the Humphrey Visual Field Analyzer (Carl Zeiss Meditec Inc.) was conducted on all subjects.

All tests were completed within a week in order to avoid changes in glaucoma stage while collecting the data.

Examination of the subjects was carried out using E-Snellen visual acuity test and conventional slit-lamp and funduscopic techniques to rule out anomalies of the anterior or posterior segment. Intraocular pressure (IOP) was measured using contact Goldman tonometry. Subjects accepted for research had best corrected visual acuity of 20/40 or better, spherical refraction between ±5.0 diopters and cylinder correction less than 3.0 diopters. They did not have any previous ocular surgery, except uncomplicated cataract extraction and intraocular lens implantation, and had no other anterior or posterior segment disease.

Eligibility CriteriaOne eye was randomly selected if both eyes were eligible for the study. Inclusion criteria for the patients with early stage glaucoma included a basal IOP ≥21 mm Hg, gonioscopic open angle, and a reproducible VF defect in the absence of any other neurological or ocular defect that may cause an abnormal VF test result. VF test was conducted at least three times. Based on Hoddap-Parrish-Anderson criteria, early VF damage was defined as an arc-like perimetric defect with mean deviation ≥ -6.00 dB, glaucoma hemifield test result outside normal limit or corrected pattem standard deviation (CPSD)/pattern standard deviation (PSD) significant at P<0.05[5]. Superior, inferior or both hemifields were affected. Reliability was calculated on a false negative rate <15%, false positive rate <10% and fixation losses <10%. Exclusion criteria for early glaucoma group included eyes with close or anomalous angle, IOP<21 mm Hg, unreliable VFs or anomalous disc aspect such as tilted discs.

Inclusion criteria for controls were those with no history of eye disease, no family history of glaucoma, IOP<21 mm Hg, no sign of glaucoma optic disc damage, and a normal result on three followed reliable VF tests. Eyes with VF defects, IOP>21 mm Hg or any anomalous appearance of the optic nerve were excluded in the control group.

OCT scans with segmentation errors or signal-to-noise ratio index lower than 45 were rejected. All examinations were reviewed by an experienced ophthalmologist and discarded if there was any suspicion of being decentered or not being completely reliable.

Analysis AlgorithmsAll statistical and machine learning algorithms were carried out using R-free software environment for statistical computing (version 3.6.2 for Windows).

Unsupervised k-medoid cluster analysis was taken with the unlabeled dataset, using all the peripapillary RNFL measurements obtained with OCT. Points located further than the sum of the mean distance of the points in each cluster to its center plus two times the standard deviation were considered as outliers and excluded of the dataset for further analysis[6]. Tree GB methods were applied to the processed data in order to produce a prediction model in the form of an ensemble of decision trees. These algorithms improve the accuracy of predictions, at the expense of loss in interpretation of the model[7]. Relative importance of the parameters in GB trees was calculated, and those with the greatest importance were selected. GB explainer extension was used to interpret the generated GB model, exploring the correlations between the numeric value of each parameter and its impact on classification.In order to minimize overfitting, only parameters with the highest importance on the GB model were used as input data to build a pruned decision tree. A decision tree was chosen as the preferred classification algorithm due to its high level of interpretability, allowing the results obtained in our model to be checked against previous knowledge regarding RNFL damage in glaucoma. In order to minimize the number of early glaucoma eyes misclassified as normal eyes, a weight matrix was introduced in the model, and a fitted tree classifier was constructed.

The outcome of our study will be a decision tree classifier. Detection of parameters with the greatest relative importance will be assessed. The ability to distinguish between glaucoma and controls will be evaluated using ten-fold cross validation, estimating the accuracy rate, AUC curves and sensibility and specificity.

RESULTS

In this study we recruited 90 eyes with early stage glaucoma and 85 controls, who were included in the study. All participants were Caucasian race.

Baseline characteristics of the study population are summarized in Table 1. Optic disc size difference was not significant among the groups.

Table 1 Clinical characteristics of the control and early glaucoma groups

Kolmogorov-Smirnov one-sample test showed Gaussian distribution of the variables. Unpaired t-test analysis demonstrated significant difference for all RNFL values except for 9 clock-hour RNFL thickness.

Unsupervised clustering revealed four eyes in the early glaucoma group as outliers which were excluded from the dataset for further analysis.

GB analysis displayed average thickness as the parameter with the highest importance in the classification model, followed by 7 clock-hour thickness (Figure 1).

Scatter-plot graphic correlating thickness values for the different RNFL parameters and the impact on classification, estimated through the log-odd of classification, showed a stairlike pattern for average thickness, with a plateau at the values <80 µm, >90 µm and between 80 and 90 µm (Figure 2). Similar correlation graphs for the rest of the RFNL values exhibited a flat pattern with no step depending on thickness value.

The pruned decision tree built after excluding all those parameters which GB proved to have a low impact on classification is shown in Figure 3. Eyes with an average thickness inferior to 82 µm are glaucomatous eyes in 98% of cases. Eyes with an average thickness >90 µm and 7 clock-hour thickness >99 µm have a 95% predicted probability of being healthy. The overall classification accuracy evaluated through ten-fold cross validation was 89%. Estimated false negative rate was 13% and AUC of the model was 0.953 (95%CI: 0.903-0.998). Sensibility reached 0.89 with specificity 0.885.

A modified tree constructed after adding a weight matrix in order to minimize misclassification of glaucoma eyes included average and inferior quadrant thickness as the splitting parameters, with different cleavage values than those in the non-weighted tree (Figure 4). However, classification accuracy of the model drops to 84% as estimated with ten-fold cross validation. AUC of the model was 0.85 (95%CI: 0.752-0.958) and sensitivity was 0.745 with specificity of 0.956.

Figure 1 Relative importance of average, quadrants, and clockhour segments of peripapillary retinal nerve fiber layer thickness on the tree gradient boosting model CH: Clock-hour; IQ: Inferior quadrant; SQ: Superior quadrant; TQ: Temporal quadrant; NQ: Nasal quadrant.

Figure 2 Impact on classification represented against average RNFL thickness Impact on classification is expressed as the log-odd of classification. A positive impact value is associated with a greater probability of being classified as early glaucoma. RNFL: Retinal nerve fiber layer.

Figure 3 Pruned classification tree constructed after excluding outliers and including features which the greatest relative importance in gradient boosting model Connecting lines include the splitting value for the parameter. Leaf nodes show the class of the node, the predicted probability of being healthy and the percentage of observations in the node. Average: Average RNFL thickness; CH: Clock-hour thickness.

Figure 4 Modified classification tree built adding a loss matrix to minimize misclassification of glaucoma as healthy eyes Connecting lines include the splitting value for the parameter. Leaf nodes show the label of the node, the predicted probability of being healthy and the percentage of observations in the node. Average: Average RNFL thickness; CH: Clock-hour thickness.

DISCUSSION

Early diagnosis in glaucoma constitutes one of the biggest challenges in ophthalmology. The importance of its early detection is justified by Mansberger: it is a significant public health problem; it is treatable and is able to be detected in early stages[8].

Clinical evaluation has been, until recent times, the gold standard for early detection[2]. However, VF defects are often difficult to evaluate due to the high variance in patients with low reproducibility, especially in elderly patients with cognitive impairment[9]. OCT technology provides an objective and reproducible analysis of the optic nerve and peripapillary RNFL. Many studies have shown evidence that structural damage often precedes the detection of functional changes, which supports the validity of the analysis of the RNFL with OCT for early diagnosis in glaucoma[1].

Several studies have reported the correlation between RNFL defects and VF damage[10-14]. Some authors have described macular ganglion complex cell (GCC) thinning in glaucoma. However, numerous studies support that RNFL parameters are superior to GCC for glaucoma diagnosis with OCT[15-17]. Vascular changes at the optic disc might provide useful information for diagnosis of glaucoma. Bojikian et al[18]have shown significant correlation between peripapilary blood flux and VF parameters and structural biometrics in primary open-angle glaucoma. In any case, OCT provides a huge amount of information that is not easily interpreted with traditional statistical techniques.

Machine learning offer tools to deal with datasets which exhibit a large size and high multidimensionality. Using these techniques, we can find hidden patterns that would be undetected with classical statistical analysis. The number of these methods increases day by day, making it extremely complicated to decide which is the best method to apply. Classification and regression trees, linear discriminant analysis, Naïve Bayes, artificial neural networks, deep learning algorithms, random forest and other machine learning methods have been widely used to predict, classify or detect ocular diseases[19-25]. These methods obtain a high level of accuracy and can achieve a nearly perfect classification rate. However, they often lack one of the fundamental aspects of science: interpretability.

Interpretability is defined as the capacity to gain insight into how a data mining method gets to the result[4]. It provides the possibility to link the results obtained using data mining techniques with previous knowledge. Tree based methods are highly interpretable and straightforward for visual understanding. However, they suffer from a high variance depending on the dataset used to train the model[26].

Gradient boosting is a new algorithm based on the construction of an ensemble of models[7]. It builds multiple models and combines their results to increase performance. GB also provides a rate of importance on the model for the different parameters that compounds the input, which have demonstrated a high local stability and consistency, opposite to other inconsistent techniques such as random forest and other machine learning methods[27]. Moreover, GB has an explainer extension containing a method to clarify why each example is classified in each group.

GB analysis applied to our data revealed that average RNFL thickness is the critical parameter to discriminate between early glaucoma and normal eyes, with a relative importance that is three times higher than the next most important parameter. Average thickness values display a stair-like graphic when plotted against its impact on the classification model, with steps at about 80 and 90 µm, indicating that inside the segments of <80 µm, 80 to 90 µm and >90 µm the influence on the classification model remains stable.

RNFL thickness at 7 clock-hour is the second most important parameter in our classifier. When plotted against its degree of impact on the model a flat horizontal graph is obtained, with no significant step, representing a linear relation between 7 clock-hour thickness and impact on the result of the model.

Using a decision tree method, Huang found that the inferior quadrant was the parameter in RNFL with the best classification importance in glaucoma eyes. Using univariate analysis, this article revealed average, inferior quadrant and 6 and 5 clockhour thicknesses as the parameters with the best AUC[28]. In Medeiros et al[29]the highest discrimination power of individual OCT parameters was from the inferior quadrant thickness. Baskaran et al[30]reports that average and inferior quadrant RNFL thicknesses discriminate glaucoma better than the other parameters. Some authors describe that combining

OCT parameters by means of a multivariable predictive model outperforms univariable models in terms of AUC and classification rates[31-32]. Multiple studies combine several OCT features in a single parameter which contain information of its components, in order to obtain better classification models[33-34]. Nonetheless, this combination makes interpretation of the model completely unreachable. All these univariable and multivariable studies find disparate results regarding the parameter with the greatest discrimination power, though the general consensus is that average thickness and inferior hemisphere segments have the best AUC, similar to the importance rate we describe in our model.

Our classification tree constructed including the best performing discriminating parameters reveals that average thickness remains as the key parameter. Eyes with an average thickness <82 µm are mostly predicted as glaucoma (probability: 98%). If average thickness >90 µm and RNFL at 7 clock-hour >99 µm, the eye is classified as normal (predicted probability: 95%). These cut-off points for average thickness values are closely related to those found in our GB model.

Using ten-fold cross validation, the calculated overall accuracy of the model was 89%, with an AUC of 0.953, which is better than AUC for any univariable model. Specificity of the model is 0.89 for a sensibility of 0.88.

Classifying a glaucoma eye as normal is a much more serious error than labeling a normal eye as glaucoma, since undiagnosed glaucoma can lead to irreversible visual loss. This consideration was included in our tree through a loss matrix, in order to minimize the false negative rate. This modified tree revealed average thickness as the principal parameter for classification. Eyes with an average thickness <80 µm are all glaucoma eyes. Average thickness ≥80 µm with inferior quadrant thickness ≥92 µm suggests a highly probable normal eye (predicted probability: 76%).

The estimated accuracy for the weighted model was 84% of correctly classified eyes, which is considered a good classifier. Moreover, the lower accuracy is balanced with a minimum false negative rate. AUC of the model was 0.85, which is lower than AUC in the non-weight model, but specificity increased to 0.956.

Limitations to our study include the lack of a test group. Instead, we have made use of cross validation as a verified resampling method to evaluate the performing of our model. Overfitting has been controlled by means of outliers detection, selecting those parameters with the best relative importance in GB model and pruning of the tree. We have only included eyes with early damage and IOP>21 mm Hg. Therefore, our results may not be applicable to more advanced glaucoma eyes or normal-tension glaucoma.

As a conclusion, GB methods based on tree structures selected RNFL average thickness as the parameter with the greatest power to distinguish early glaucoma eyes. Thickness at the 7 clock-hour is the second most important parameter for classification. The fact that our classifier depends only on RNFL thickness favors its use in patients with unreliable visual field test, expanding the number of cases where our model can be valuable.

Although the number of subjects included in the present study is limited, we strongly believe that the development of an interconnected digital data warehouse coming from the electronic devices of multiple ophthalmology institutions will offer the foundations for future decision and classification models if adequate interpretable machine learning methods are applied.

ACKNOWLEDGEMENTS

Conflicts of Interest: Fernandez Escamez CS,None;Martin Giral E,None;Perucho Martinez S,None;Toledano Fernandez N,None.