Paris Charilaou,Robert Battat
Abstract Machine learning models may outperform traditional statistical regression algorithms for predicting clinical outcomes. Proper validation of building such models and tuning their underlying algorithms is necessary to avoid over-fitting and poor generalizability, which smaller datasets can be more prone to. In an effort to educate readers interested in artificial intelligence and model-building based on machine-learning algorithms, we outline important details on crossvalidation techniques that can enhance the performance and generalizability of such models.
Key Words: Machine learning; Over-fitting; Cross-validation; Hyper-parameter tuning
Conet al[1 ] explore artificial intelligence (AI) in a classification problem of predicting biochemical remission of Crohn’s disease at 12 mo post-induction with infliximab or adalimumab. They illustrate that, after applying appropriate machine learning (ML)methodologies, ML methods outperform conventional multivariable logistic regression (a statistical learning algorithm). The area-under-the-curve (AUC) was the chosen performance metric for comparison and cross-validation was performed.
Their study elucidates a few important points regarding the utilization of ML. First,the use of repeated k-fold cross-validation, which is primarily utilized to prevent overfitting of the models. This technique, while common in ML, it has not been traditionally used in conventional regression models in the literature so far. Especially in small datasets, such as in their study (n= 146 ), linear (and non-linear, in the case of neural networks) relationships risk being “learned” by chance, leading to poor generalization of the models when applied to previously “unseen” or future data points. It was evident from their analysis that the “naïve” AUCs (training the model on all the data), was significantly higher than the mean cross-validated AUCs, in all 3 models, suggestive of “over-fitting” when one does not cross-validate. Smaller datasets tend to be more susceptible to over-fitting as they are less likely to accurately represent the population in question.
Second, the authors utilized “hyper-parameter tuning” for their neural network models, where the otherwise arbitrarily selected “settings” (or hyper-parameters, such as the number of inner neuron layers and number of neurons per layer) of the neural network are chosen based on performance. Hyper-parameters cannot be “learned” or“optimized” by simply fitting the model (as it happens with predictor coefficients),and the only way to discover the best values is by fitting the model with various combinations and assessing its performance. The combinations can be evaluated stochastically (randomly orviaa Bayes-based approach) or using a grid approach (e.g.,for 3 hyper-parameters that take 5 potential values, there are 5 × 5 × 5 = 53 = 125 combinations to evaluate) over k times. One may ask, if one was to fit a model 125 × k times, on 146 observations, is not there a risk for over-fitting the “optimal” hyperparameter values? To avoid such a problem, nested k-fold cross-validation must be performed: within each repeated k-fold training data subset, a sub-k-fold “inner”training/validation must be done to evaluate each hyper-parameter combination. In this way, we overcome potential bias to optimistic model performance, which can occur when we use the same cross-validation procedure and dataset to both tune the hyper-parameters and evaluate the model’s performance metrics (e.g., AUC)[2 ]. The authors did not elaborate on how the hyperparameter tuning was performed.
Another point to consider in k-fold cross-validation in small datasets is the number of k-folds used, specifically in classification problems (i.e., yes/no binary outcomes). In this study[1 ], the outcome prevalence was 64 % (n ≈ 93 ). With a chosen k = 5 , the training folds would comprise 80 % of that data, leading to approximately 74 positive cases of biochemical remission. The number of positive outcomes in each training fold must be considered, especially in logistic regression, where the rule of thumb recommends at least ten positive events per independent predictor, to minimize overfitting[3 ]. In this study[1 ], six predictors were eventually used in the multivariable model, making over-fitting less likely from a model-specification standpoint. Finally,k-folds are recommended to be stratified by the outcome, so the outcome prevalence is equal among the training and testing folds. This becomes crucial when the prevalence of outcome of interest is < 10 %-20 % (imbalanced classification problem). While imbalanced classification is not an issue in this study[1 ], the authors did not mention whether they used outcome-stratified k-folds.
Lastly, the endpoint utilized, CRP normalization, has poor specificity for endoscopic inflammation in Crohn’s disease[4 ]. More robust endpoints would include endoscopic inflammation and/or deep remission using validated disease activity indices[5 ].
We congratulate the authors for their effort, which acts both as a proof-of-concept for using ML in improved prediction of outcomes in IBD, but also for the methodologies outlined to reduce over-fitting. In general, with the advent of AI and specifically ML-based models in IBD[6 ], it is important to recognize that while now we have the tools to construct more accurate models and enhance precision medicine, most MLbased models, such as artificial neural networks, lack in being intuitively interpretable(i.e., “black-box”). Efforts in “explainable AI” are under way[7 ], hopefully eliminating the “black-box” concept in future clinical decision tools. Applying these to validated disease activity assessments will be essential for prediction models in future studies.
World Journal of Gastroenterology2022年5期