Junxiong Yin,Cheng Yu,Lixia Wei,Chuanyong Yu,Hongxing Liu,Mingyang Du,Feng Sun,Chongjun Wang,Xiaoshan Wang*
1Department of Neurology,Brain Hospital Affiliated to Nanjing Medical University,Nanjing 210029,China
2Department of Computer Science and Technology,Nanjing University,Nanjing 210093,China
Key words:high-risk population; stroke; asymptomatic carotid stenosis; risk factors; machine learning
Objective Asymptomatic carotid stenosis (ACS) is closely associated to the incidence of severe cerebrovascular diseases.Early identifying the individuals with ACS and its associated risk factors could be beneficial for primary prevention of stroke.This study aimed to investigate a machine-learning algorithm for the detection of ACS among high-risk population of stroke based on the associated risk factors.Methods A novel model of machine learning was utilized to screen the associated predictors of ACS based on 30 potential risk factors.The algorithm of this model adopted a random forest pattern based on the training data and then was verified using the testing data.All of the original data were retrieved from the China National Stroke Screening and Prevention Project (CNSSPP),including demographic,clinical and laboratory characteristics.The individuals with high risk of stroke were enrolled and randomly divided into a training group and a testing group at a ratio of 4:1.The identification of carotid stenosis by carotid artery duplex scans was set as the golden standard.The receiver operating characteristic (ROC) curve and the area under the curve (AUC) was used to evaluate the efficacy of the model in detecting ACS.Results Of 2841 high risk individual of stroke enrolled,326 (11.6%) were diagnosed as ACS by ultrasonography.The top five risk factors contributing to ACS in this model were identified as family history of dyslipidemia,high level of lowdensity lipoprotein cholesterol (LDL-c),low level of high-density lipoprotein cholesterol (HDL-c),aging,and low body mass index (BMI).Their weights were 11.8%,7.6%,7.1%,6.1%,and 6.1%,respectively.The total weight of the top 15 risk factors was 85.5%.The AUC values of the model for detecting ACS with training dataset and testing dataset were 0.927 and 0.888,respectively.Conclusions This study demonstrated that the machine-learning algorithm could be used to identify the risk factors for ACS among high risk population of stroke.Family history of dyslipidemia may be the most important risk factor for ACS.This model could be a suitable tool to optimize the clinical approach for the primary prevention of stroke.
ASYMPTOMATIC carotid stenosis (ACS) is defined as a significant (≥50%) narrowing of carotid artery in patients without ipsilateral neurological deficit symptoms during the past 6 months.[1]ACS is often neglected clinically.Under asymptomatic conditions,ACS patients could develop cerebral infarction rapidly with serious consequences.Several studies have confirmed that ACS is an independent risk factor for cerebral infarction.[2,3]In order to identify ACS as early as possible,extensive screening is necessary among “normal” population.However,given the cost-effectiveness for the screening,the U.S.Preventive Services Task Force (USPSTF) and the American Society of Neuroimaging both do not recommend screening ACS in general adult population.[4-6]Meanwhile,other researchers suggested the targeted screening in people with multiple risk factors for atherosclerosis.[7]Therefore,most previous studies chose individuals with increased risk for stroke as screening subjects,but how to identify high-risk subjects effectually and conveniently is still uncertain.
Machine-learning model was initially designed to analyze largely complex datasets.Compared with traditional statistical methods,machine learning has advantages in dealing with complex data.Some researches[8-10]showed that machine learning in prediction model worked better than those of general statistical methods.Nowadays,it has been widely used to learn complex relationships from data to predict cardio-cerebrovascular disease and generate medical imaging reports.[8,9]Recently,machine-learning techniques for predicting carotid plaque progression have been successfully developed,which mainly involve analyzing vulnerable plaque progression in apoE-/-mice.[10]However,the study was based on animal models and overlooked some traditional risk factors,such as hypertension and diabetes.Several clinical studies[11,12]demonstrated that elder age,smoking,high blood pressure and diabetes were independent risk factors for carotid stenosis (>50% stenosis),and have generate some predictive models for ACS.Nevertheless,the analytic approaches above were mainly traditional frequentist biostatistical methods.
Machine learning consists a variety of algorithms,such as support vector machine,decision tree,deep learning,ensemble learning,neural network learning,and so on.Random forest is a kind of machine-learning algorithm combining ensemble learning and decision tree.[13]By combining several weak classifiers,the final result can be voted or averaged,which makes the results of the whole model with higher accuracy and better generalization performance.
The aim of this study was to identify the effectiveness of random forest model in differentiating ACS patients from high risk individuals of stroke using data from the China National Stroke Screening and Prevention Project (CNSSPP).
This cross-sectional study was designed and approved by the institutional review board of Nanjing Brain Hospital (2017-kyy119-01).All participants received information on the study and provided written informed consents.The data of this study was based on data from Nanjing Brain Hospital,one of study centers participating the project of China National Project Office of Stroke Prevention and Control (CNSSPP).During 2012-2016,more than 34000 residents participated in the CNSSPP.According to the risk factors screening,5250 participants were identified as high risk individuals for stroke.Among them,2309 cases were excluded due to a history of stroke or transient ischemic attack (TIA),and 100 cases were excluded for the incomplete data.Thus,totally 2481 high risk individual for strokes were enrolled in the current study.
The screening work complied with the protocol of CNSSPP.[14]A high-risk individual for stroke was defined as:≥40 years old with at least three of the following risk factors:hypertension,atrial fibrillation,current smoking,dyslipidemia,diabetes mellitus (DM),physical inactivity,overweight (BMI ≥26 kg/m2),[8]and family history of stroke.
The high risk individuals were subjected to take carotid artery duplex scans following Chinese stroke vascular ultrasound examination guidelines,[15]which consisted of ultrasound imaging of the distal common carotid artery,bulb,proximal internal,and external carotid arteries,with Doppler signal evaluation for 3 to 5 beats in each location on both sides.Plaque was defined as intima-media thickness (IMT) greater than 1.5 mm based on Doppler-derived.[16]According to previous studies,[12,17]ACS was identified if diameter of carotid lumen decreased by ≥50%,along with at least one of the following indicators:a peak systolic velocity (PSV)≥125 cm/s,end diastolic velocity ≥40 cm/s,or the ratio of internal carotid artery (ICA) PSV/common carotid artery (CCA) PSV >2.0.The carotid duplex scans were performed by four experienced registered vascular technicians who were independent of clinical information.
Each candidate was asked to complete a standardized questionnaire designed by CNSSPPviaface-to-face interviews by well-trained staff before the carotid artery duplex scan.Information we collected for 53 risk factors included demographic characteristics,lifestyle factors,physical examination results,medical history,and family history of stroke.Subjects also took fasting blood tests for laboratory indexes,such as plasma glucose (FPG),homocysteine (Hcy),total cholesterol (TC),high-density lipoprotein cholesterol (HDL-c),low-density lipoprotein cholesterol (LDL-c),hemoglobin A1C(HbA1c),triglyceride (TG),etc.
Some risk factors of the record were removed from analyses due to missing value,duplicated or invalid values.Consequently,30 risk factors of 2841 subjects retained for further random forest analyses(SupplementaryTable S1),which consisted of atrial fibrillation,diabetes mellitus (DM),hypertension,dyslipidemia,overweight,physical activity,family history of stroke/coronary heart disease (CHD)/DM/dyslipidemia,smoking,alcohol drinking,diet habits (positive/negative/blend),occupation,PayStyle categories,etc.
We trained a machine-learning algorithm using random forest to distinguish the patients with ACS based on the data from CNSSPP.Some of these individuals were tested using this algorithm to evaluate the accuracy rate.The random forest was implemented by the researchers from the department of Computer Science and Technology,Nanjing University.
The good results of random forest mainly depended on “random” and “forest”.“Random” made it resist over-fitting,and “forest” made it more accurate,as shown inFigure 1.The weak classifier used in random forest was CART tree,also called classification regression tree.A decision tree was a predictive model,representing a mapping relationship between object attributes and object values.Each node in the tree represented an object.Each branching path represented a possible attribute value.Each leaf node corresponded to the value of the object represented by the path that the root node went through the leaf node.Decision tree had only a single output.A risk classification for unilateral carotid stenosis was presented inFigure 2for illustration.
Random forest was applied by using Tensorflow code style.[18]The running environment of random forest graph algorithm program was as follows:system version,Ubuntu 16.4; python version,3.6; CPU,Intel(R) Xeon (R) CPU e5-2630 V4 @ 2.20GHz; memory,320G.We inputted all the data into the edited random forest algorithm,setting the diagnosis of CAS on carotid artery color Doppler ultrasound as the task marker.The processes of modeling and verification are presented as inFigure 3.
First,we randomly sampled the original data set to generate a new data set,and then selected the feature that causes the largest change in the Gini index when the node was split to grow decision tree.Stop generating until the samples in each node were of the same category to get a trained decision tree.Repeated the above steps until a specified number of decision trees were generated to form a random forest.
After each sample in the test dataset was judged by each decision tree in the trained random forest model,the prediction result was obtained.They are subsequently voted by the majority to obtain the final prediction label of the model for the sample.The final predictions were compared to the true label (diagnosis by ultrasound) of the sample for receiver operating characteristic (ROC) evaluation.The area under the curve (AUC) was used as the model measurement standard.
The data preprocessing procedures of modeling were explained briefly as follows:
Firstly,we removed similar,duplicated,or not relevant variables,then we normalized the variable from text to number in the form of one to one,and used -1 to fill in the missing value.Then the dataset was split randomly into a section of 80% as training set and a section of 20% as testing set.
In the second step,we gave the values of several model hyper parameters.Then sample extraction was used for the training set to build a new sample set,which generated a new decision tree.We repeated the process to generate a specified number of decision trees to form a forest.[19]
The third step was to adjust the hyper parameters of model,and then to repeat the second step to generate numerous numbers of different random forests.The testing set was used to evaluate the performance of each detective model based on accuracy,true positive rate,false positive rate.The effectiveness of the model was evaluated by area under receiver operating characteristic curve.
The fourth step involved calculation of variable importance.After identifying the best performing model among the large number of models generated,we calculated the Gini index of each node in each decision tree,which was the probability that two samples randomly selected from the nodemhad inconsistent class labels.Then the change of Gini index caused by branching of the node was calculated.Finally,the importance of the feature,also called “weight”,was calculated by the branching in which the feature participated.
To characterize the random forest algorithm,the following important parameters were used in this study:
1) N_estimators:the maximum number of iterations of a weak learner,or the largest number of the weak learners.This referred to the number of decision trees in a random forest.
2) Max_features:the maximum number of features considered when partitioning nodes.The value was used to flexibly control the generation time of decision tree.
3) Max_depth:the maximum depth of decision tree.It could limit the times of node splitting.
4) Min_samples_split:the minimum number of samples required for internal node splitting,which limited the conditions for the continuation of subtree splitting.
5) Min_samples_leaf:minimum sample number for leaf node.If the number of samples for a leaf node was less than the specified number,it would be pruned together with sibling nodes.
The demographic,clinical and laboratory results between ACS group and non-ACS group were compared by usingt-test for continuous variables and Chi-square test for categorical variables.Continuous variables were described as mean and standard deviation (SD),and categorical variables were described as percentages.All statistical analyses were performed using SPSS(Version 20.0,SPSS Inc.,Chicago,IL,USA).Pvalue<0.05 was considered statistically significant.
Of 2841 individuals enrolled in the study,according to ultrasound findings,the carotid plaque were found in 1000 (35.2%),and the ACS was found in 326 (11.5%).Comparison of the characteristics of ACS and non-ACS participants (Table 1) revealed the significant differences existed between the two groups in sex,age,marriage,BIM,waistline,lsDrink,overweight,whether concurrent with atrial fibrillation,DM,hypertension,whether has family history of stroke,CHD,DM,dyslipidemia,and the levels of FPG,HDL-c,LDL-c.
Through calculation and optimization,the top five risk factors for ACS in high-risk population of stroke and their corresponding weights were as follow:positive family history of dyslipidemia,11.8%; high level of LDL-c,7.6%; low level of HDL-c,7.1%; older age,6.1%; and low BMI,6.1%.The accumulated weight of the top 15 risk factors was 85.5%,which were presented inTable 2.
The parameters of this model for detecting ACS based on machine learning were as follow:
1) N_estimators:10
2) Min_samples_split:100
3) Max_features:4
4) Min_samples_leaf:21
5) Max_depth:19
Based on the above parameters,we input the 30 risk factors into the random forest model.The AUC value of the model predicting ACS in training dataset was 0.927,and in testing dataset was 0.888.The ROC curves were plotted as shown inFigure 4.
The novel finding of our study was that the machine learning method could be applied to detect potential ACS patients among the high risk individuals of stroke.Our results further demonstrated the potential utility of random forest approach in detecting ACS with our datasets.
The area under the ROC curve of our random forest algorithm was 0.888 in testing set,which was superior to that in previous studies.[11,12,20]Grecoet al.created a model for predicting the risk of carotid artery disease based on data of carotid Duplex scans from 2,885,257 individuals in the Life Line Screening between 2003 and 2008.[12]The independent risk factors for ACS were advanced age,smoking,peripheral arterial disease,high blood pressure,coronary artery disease,diabetes,cholesterol,etc.According to these independent risk factors,a predictive scoring system was created,and the AUC of the ROC was 0.753.This was a remarkable study which predicting ACS with a national cohort across the population.The AUC value of our present study was better than that of the Life Line Screening cohort,which was probably because:1) the population in our study was mainly the selected high-risk individuals of stroke,whereas the latter was the general population;[12]2) the predictive model of the two studies were different.
There were also other studies on predictive model.Jacobowitzet al.[11]created a model to predict carotid artery stenosis in population older than 60 years and had at least one of the following risk factors:history of CHD,previously diagnosed hypertension,current smoking,and family history of stroke in the first-degree relatives.The study was performed on 394 individuals,of which 9.6% had either unilateral or bilateral carotid artery stenosis.Risk factors entered into the logistic regression analysis were hypertension,cardiac disease,current smoking,and hypercholesterolemia.The model showed that,the prevalence of carotid stenosis was 1.8% with no risk factor,5.8% with one risk factor,13.5% with two risk factors,16.7% with three risk factors,and 66.7% with four risk factors.Qureshiet al.[20]created a simple scoring system to identify high risk individuals for ACS based on routine data collected in a community health screening.They evaluated 1331 unselected volunteers who had no previous history of stroke,TIA,or carotid artery surgery,and found four risk factors that were significantly associated with ACS:age older than 65,current smoking,coronary artery disease,and hypercholesterolemia.The AUC of this system was 0.706.
Table 1.Comparison of demographic,clinical and laboratory characteristics between ACS and Non-ACS subjects (n=2841)
In recent years,many studies using variousmachine learning models to predict medical events in domain of healthcare have been published.These studies involve workflow of clinical care,cancer survival,and cardiovascular diseases.[21-23]Some studies applied machine learning to predict carotid atherosclerosis.Huet al.[24]used machine learning to predict rapid progression of carotid atherosclerosis in 382 patients with impaired glucose tolerance,which suggested that the best machine learning method was Naïve Bayes (AUC 0.797),and machine learning could be applicable in a relatively small number of subjects.Liet al.[10]predicted carotid plaque progression in ApoE-/-mice using support vector machine and decision tree,and concluded that the method could be suitable for identifying vulnerable plaque progression in mice.Yet,there has been no clinical study on detecting ACS in high-risk population of cerebral infarction using machine learning.
Table 2.Risk factors and the weights for predicting moderate to severe carotid stenosis by algorithm model based on random forest machine learning
The results of our study showed that family history of dyslipidemia was the most important risk factor,which had not been reported yet.Risk factors for ACS reported[20,25]include advanced age,current smoking,peripheral arterial disease,hypercholesterolemia,hypertension,diabetes mellitus,and coronary artery disease.Many studies had paid close attentions to family history of stroke/hypertension/diabetes,whereas few had looked at family history of dyslipidemia.[26]Familial hypercholesterolaemia is a common genetic cause of premature coronary heart disease,[27]and can lead to cardiovascular disease (CVD).Most cases are caused by autosomal dominant mutations in low density lipoprotein receptor (LDLR) gene.[28]The family history of dyslipidemia,as a risk factor for ACS,did not gain attention in the early-stage study with logistic regression.[29]The result of current machine learning study suggests that more attention should be given to the family history of hyperlipidemia in studies on carotid atherosclerosis in future,especially in population with LDLR genes mutation.
The current study has several limitations.The data we retrieved from the CNSSPP were a single centric dataset,the number of subjects in this study was limited.Additionally,the model is only applicable in high risk individuals of stroke to detect ACS,and did not consider the influence of drug using,e.g.,statins which is an important aspect for ACS.Furthermore,the detection model provides weight and ranking of the risk factor,but how to apply them conveniently in a practical setting needs further study.
Conflict of interests disclosed
None.
Supplementary materials
Table S1:30 risk factors for random forest analyses.
Available online at http://cmsj.cams.cn/EN/10.24920/003703.
Chinese Medical Sciences Journal2020年4期