Similarity measurement of Chinese medicine ingredients for cold-hot nature identificatio n

2019-11-01 03:01GuoHuiWeiXianJunFuZhenGuoWang
TMR Modern Herbal Medicine 2019年4期

Guo-Hui Wei,Xian-Jun Fu,Zhen-Guo Wang*

1Key Laboratory of Theory of TCM,Ministry of Education of China,Shandong University of Traditional Chinese Medicine,Jinan,China.

Abstract

Keywords:Traditional Chinese medicine,Chinese Medicine ingredients,Ultraviolet spectrum,Similarity measurement,Cold-hot nature

Background

As one of the core elements of Traditional Chinese Medicine(TCM),nature theory of TCM has attracted the attention of scholars and research institutions for many years.The nature of Chinese medicine(CM)contain four types i.e.cool,cold,hot,and warm,in which cold and hot nature is an important part of TCM nature theory[1,2].”Treating the hot syndrome with cold nature medicine and treating cold syndrome with hot nature medicine”indicates that cold or hot property of medicine nature theory is an important basis for TCM treatment in regulating the balance between Yin and Yang of human body,and the application of cold-hot medicine nature leads to effective treatment in TCM clinical medicine[3].

Numerous specialists maintain different views on the TCM cold-hotnature.Jinetal.[4]proposed a‘three-element’mathematical analysis model to research biological character of TCM in the basis of cold-hot medicine nature.Zhaoet al.[5]explored a cold/hot plate method to differentiating cold-hot nature of Mahuang and Maxingshigan decoctions.Wanet al.[6]studied the effect of TCM with different properties on thermoregulation and temperature-sensitive transient receptor potentialion channelprotein ofrats with yeast-induced fever.Lianget al.[7]analyzed the cold and hot properties of Chinese medicinal herbs with molecular network and chemical fragment methods.Wanget al.[8]identified 59 CHMs with typical cold/hot properties by self-organizing map.Fuet al.[9,10]investigated the presence of anticancer activity displayed by cold-hot nature of traditional Chinese marine medicine with phylogenetic tree analysis and explored in Silico Mode-of-Action method to explain the cold,hot,and neutral nature of CMs.

Generally,the discrimination of cold-hot nature of CMs contains two parts:feature representation and nature classification.Feature representation uses original effects of CM,fingerprint technology or metabolomics method to extract the characteristics of CM.Nature classification needsuse classical machine learning classifiersor constructed classifiers to discriminate the cold-hot nature of CMs.Original effects of Chinese medicine is an effective characteristic expression.Xue research group[11-13]explored original efficacy features of CMs in“Chinese Herbal Medicine(CHM)”and used classical classifiers(such as artificial neural network)to classify the unknown nature of CMs.Metabolomics method is also used to represent the CMs.Nieet al.[14]studied Metabono mic features of CMs and constructed a random forest model to discriminate the unknown nature of CMs.Chemicaltechnology isan important method for analyzing the cold-hot nature.Longet al.[15]analyzed the chemicalco mponents of 284C Ms with clear medicine nature,and explored a combination system for predicting cold-hot nature of other CMs.Other methods,such as nuclear magnetic resonance spectroscopy of proton(1H-NMR),are used to investigate the feature of CMs.Liet al.[16]studied the characteristics of CMs with 1H-NMR and applied pattern recognition techniques to analyze the unknown nature of CMs.

Except classical classifiers,retrieval scheme is one of the popular and effective classification schemes,which has been applied widely in identifying benign and malignant of Mammography and pulmonary nodules[17-20].Compared with traditional classifiers,retrieval scheme can provide most similar cases for analysis and reference.Therefore,similarity measure plays an important role in retrieval scheme for classification.Our group has done a lot of research work on similarity measurement of pulmonary nodules images[20,21].We quantify the similarity of pulmonary nodules images to distance metric.Although similarity measurement has been widely studied in medical images,it is rarely used in the nature identification of CMs.

The current research of medicine nature focused on revealtheconnection ofCM natureand material composition within CMs.Forexample,chemical fingerprinting technique and CM nature discriminant models are applied to analyze CM material composition.The chemical fingerprint data of CM can reflect the whole composition of CM ingredients.Bioactivity is determined by material composition,and the bioactivities of CMs are the core of identifying medicine nature[2].Thus,material composition indirectly determines the nature of CMs.Studies have found that CM ingredients are the material basis for the production of medicine natures[10].Therefore,it is speculated that CMs with similar composition of substances should have similar medicinal nature.

To verify the hypothesis proposed above,in this work,we explore relationship between the CM ingredients and cold-hot medicinal nature.Firstly,we construct a CM ingredient database by using ultraviolet(UV)spectrum technology to represent 61 CMs,which have clear cold-hot nature(including 30 ‘cold’CMs and 31 ‘hot’CMs).Secondly,we study quantifying the similarity of CM ingredient to a distance metric.Mahalanobis distance is learned to measure the similarity of UV fingerprints of CMs.Finally,a retrieval scheme is proposed to build a predictive identification model to predict the cold-hot nature of CMs.

Materials and Methods

TCM Dataset

61 representative CMs are analyzed in this study,in which 30 CMs are ‘cold’medicines and others are ‘hot’medicines.All the 61 CMs have been marked in the classical ‘Chinese Materia Medica’and ‘Shen Nong’s Herbal Classic’.Table 1 shows the 61 representative CMs and their natures(characteristics in brackets).

The UV fingerprint technology is used to test the 61 CMs.The main instrument is UV-3010 UV Spectrophotometer(Hitachi,Japan).Our group recorded the absorbance of total 61 CMs in the ultraviolet wavelength of 190-400nm with four different solvents(chloroform,distilled water,absolute ethanol,petroleum ether).Detailed method for obtaining UV fingerprint can refer to the manuscript[25].As a ‘hot’medicine,Mustard Seeds has been marked in the classical‘Chinese Materia Medica’and ‘Shen Nong’s Herbal Classic’.Figure 1 shows the UV absorption curve of Jiezhi(Mustard Seeds)and GeGen(Puerariae Lobatae Radix)with petroleum ether solvent.

Figure 1.UV absorption curve of Mustard Seeds(A)and Puerariae Lobatae Radix(B)with petroleum ether solvent

Table 1.The experimental 61 representative CMs

UV Fingerprint Similarity

In this study,we investigate the relationship between cold-hot nature and material composition of CMs.To verify the hypothesis,CMs with similar composition of substances should have similar medicinal nature,we want to quantify the similarity of CM ingredients and explore the method for identifying CM nature.A UV fingerprint reflects the material composition of a CM.Therefore,we want to reveal cold-hot nature based on UV fingerprints.If the ingredients of CMs are similar,we can think that their medicinal properties are similar.Hence,CMs with similar UV fingerprints should have the same medicinal nature.

Similarity measure is defined as semantic relevance,which has been used to measure the similarity of lung nodule images in our study[21].If two CMs are both‘cold’medicine,they are semantically similar.The Mahalanobis distance is used to measure the similarity of UV fingerprints of CMs.The smaller the Mahalanobis distance,the higher the similarity of UV fingerprints.

Distance metric learning

Denote the sample dataset aswithbeing theith sample in the input space andnbeing the totalnu mber of samples.Forbetter presentation,we also denote a distance metricas a Mahalanobis distance between,which is defined as:

In Eq.(1),Tdenotes the transpose of a vector or a matrix,Mis a positive semi-definite matrix.IfM=I,corresponds to Euclidean distance.IfMis restricted to be a diagonal matrix,represents a distance metric in which the different axes are given different weights.More generally,Mrepresents a set of Mahalanobis distance.BecauseMis a positive semi-definite matrix,it can be decomposed intoM=AAT.Hence,Eq.(1)can be rewritten as:

Therefore,learning such distance metric is actually equivalent to finding a transformation of Euclidean distance between samples in the original high-dimensional space.During recent years,a variety of techniques[22]have been proposed to learn such an optimal Mahalanobis distance metricfrom training datathataregiven in the form of side information.We want to obtainAfrom the semantic relevance.

Similarity Metric

We define similarity measurement as semantic relevance.Semantic relevance can be presented by side information,which means that if two CMs have same nature(cold or hot),they are semantic relevance.Therefore,we study transformation matrixAaccording to semantic relevance.

For semantic relevance,it describes the class separability,which requires the separability measure increase when the size of the between-class scatter matrix increases or the size of the within-class scatter matrix is smaller.This can be described by the Differential Scatter Discriminant Criterion(DSDC)model[23],it is defined as:

The variation is defined as:

In(4),WSis the within-class scatter matrix,BSis the between-class scatter matrix.ρis a nonnegative tuning parameter,which balances the relative merits of minimizing the within-class scatter to the maximization of the between-class scatter.The learned matrixAis the transformation matrix.With matrixA,we can calculate Mahalanobis distance between nodule images.

At last,the learning of optimal projections *Aof the optimization problem in(4)can be solved by applying the eigenvalue decomposition on matrixS=SW-ρSB,and the projection matrix *Acan be constructed by applying the eigenvectors ofScorresponding to theksmallest eigenvalues.

The Retrieval Algorithm

1.Compute(4)with eigenvalue decomposition and obtain the transformation matrixA* withkeigenvectors corresponding tokminimum eigenvalues.

2.Calculate the Mahalanobis distancebetween samplesxiandxjbased on(2).

3.With the Mahalanobis distance,sort the distances we obtained,the retrieval inclusion are the smallest ones.

ARetrieval Scheme for Identification.

With the learned Mahalanobisdistance,a retrieval scheme based on the similarity metric is proposed to predict cold-hot medicine nature.For a TCM with unknown nature,we firstly measure the absorption degree of UV spectrum,and then compute the similarity of the UV spectrum between this TCM and the TCMs with known nature in the dataset.The calculated Mahalanobis distances are ranked based on increasing Mahalanobis distance metricsto retrieve for the ‘mostsimilar’reference TCMs.The K ‘most similar’TCMs are the reference TCMs with largest Mahalanobis distances to the query TCM.Each retrieved TCM is given a weight value as the similarity factor.The weighting factor is defined as

dkis the Mahalanobis distances between query TCM andkth retrieved TCM.Finally,acold nature probability is computed to indicate the degree of coldness of this TCM,which is the quotient of the sum of the Mahalanobis distances of retrieved cold nature medicines and the sum of the Mahalanobis distances of the K‘most similar’TCMs.The formula is defined as(C is the number of‘cold’nature medicines and H is the number of‘hot’nature medicines):

Given a threshold ofPT=0.5,ifpis abovePT,we believe that this queried TCM is ‘cold’,otherwise,it is ‘hot’.

Performance Assessment

In this subsection,to verify the feasibility of the proposed retrieval scheme for identification of cold-hot nature,extensive experiments are constructed to assess the performance of the retrieval scheme.We compare the performance of our scheme with that of the state-of-the-art classification models,including extreme learning machine(ELM)[24],artificial neural network(ANN)and supportvector machine (SVM).All experiments evaluations are on the basis of existing TCM dataset.The application assists to test unknown nature of a CM by retrieving similar UV spectra of CMs with clear cold-hot nature.In this study,we firstly compared nature identification performance of UV spectra with different solvents,and selected the solvent corresponding to the optimal identification performance.Secondly,we designed experiments to evaluate the proposed scheme performance,called stability evaluation.Thirdly,we illustrated the retrieval scheme with examples.Finally,an independent dataset is used to test the robustness of the proposed algorithm.

In our experiments,stability evaluation is used to analyze the performance of the proposed prediction model.Stability evaluation is calculated with leave-one-CM-out method[20]in the whole dataset.Each time,one CM was selected as the query CM and the remaining 60 CMs as the reference database.Because every TCM was selected as the query CM,this process was performed 61times.In this retrieval scheme,we retrieved K ‘most similar’CMs and then obtained a ‘cold’nature probability.Atlast,61 probabilities were calculated.With varying the threshold of the ‘cold’nature probability,a Receiver Operating Characteristic(ROC)curve is generated.The area under the ROC curve(AUC)and prediction accuracy(ACC)are used to evaluate the performance of our scheme.The larger the area,the more stable the model is.ACC value is the probability of correct classification of cold-hot nature of CMs.The formula ofACC is as follows:

The AUC and ACC value were applied for the stability evaluation.

Results

Performance Evaluation with Different Solvents.

The chemical fingerprints of CMs reflect the material composition of CMs.UV spectra is one fingerprint of CMs,which can be applied to discriminate cold-hot nature of CMs.In this study,we construct experiments to quantitatively analyze the relationship between material composition and the nature of CMs by means of ultraviolet spectroscopy.

In this work,the classification performance of the UV spectra with different solvents (distilled water,chloroform,petroleum ether,absolute ethanol)was analyzed to select UV data under solvent for optimal recognition performance.Leave-one-CM-out method is used to evaluate the parameters of our scheme.Figure 2 displays the ACC value curves for the medicine nature classification of the UV spectra with different solvents.The ACC value is computed as a function of the number of referenced CMs(K)retrieved to obtain a more comprehensive curve for predicting the performance of the model.In Figure 2,the curve of ACC value under petroleum ether solvent is better than that under other solvents,which means that the UV spectra of petroleum ether solvent have the best discriminant performance of cold-hot nature.When K is set as 7,the curve of ACC value under petroleum ether solvent has a peak.The identification performance reaches the maximal value 0.803.From the curve of ACC value under absolute ethanol,UV fingerprint with absolute ethanol has the lowest predicting performance.The CM nature identification with distilled water and chloroform is inferior to that with petroleum ether,but outperforms that with absolute ethanol solvent.According to the figure,the maximum ACC values of distilled water and chloroform are both 0.656.Therefore,these two solvents are poor for predicting medicine nature.

Figure 2.The curves of ACC value for the medicine nature classification.K is the number of retrieved reference CMs

In this study,the effect of parameter ρ in Eq.(4)under petroleum ether solvent is investigated to evaluate the predicting performance of cold-hot nature.The value of para meterρisset with intherange[10-3,10-2,10-1,1,5,10,102,103].Figure 3A displays the ACC value curve with different ρ.It can be concluded that the performance curve has small fluctuations and the ACC value reaches the maximum,when parameter ρ is set as 5.

The number of eigenvectorskin the proposed retrieval algorithm is analyzed within the range[50 100 150 200 210].From Figure 3B,higher ACC value can be achieved with an increasing numberk.Maximum classification performance(ACC value)corresponds to the maximum number of eigenvectors=210k.

Figure 3.The curve ofACC value under petroleum ether solvent with different ρ andk

Model Performance Assessment

To demonstrate the feasibility and stability of our proposed retrieval scheme for identifying cold-hot nature of CMs,this study compares the classification performance of our scheme(the retrieval scheme,denoted as"RS")with that of some classical classifiers(i.e.,ANN,SVM,ELM)orclassifiersusedin CM nature identification.All comparative algorithms use the optimal parameters from the dataset.According to the results of the previous section,the UV spectra data under petroleum ether solvent are used to study the cold-hold nature prediction.Table 2 shows the performance comparison of stability assessment between RS and other algorithms.Pearson correlation coefficient(PCC)is used as a comparative reference to measure the similarity of UV spectra.According to the prediction results of cold-hot nature,we can conclude as follows.Firstly,our scheme RS performs best in identification of cold-hot nature.Especially,RS and PCC havebetteri dentification accuracy than other comparison classical algorithms.This illustrates that Chinese medicines with similar ultraviolet spectrum have similar medicine nature.Secondly,ANN and ELM with UV spectral data are poor in identifying medicine nature.

Thirdly,identification accuracy of SVM is better than that of ANN and ELM.However,it is poor than our scheme.Finally,stability assessment of our scheme is the best.

Table 2.Comparison of stability evaluation

Prediction Examples

Leave-one-out method is used to obtain prediction examples.Two retrieval CM cases returned by RS are listed in Table 3.The query Chinese medicine(first row)and its top k=7 retrieved reference CMs are showed in the table.The retrieved reference CMs are computed by RS and ranked with monotonically incremental Mahalanobis distance.Cold medicine(DiFuZhi(Kochiae Fructus))and hot medicine(BiBa(Piperis Longi Fructus))are served as the examples to illustrate the principle of cold-hot medicine identification.In the first column,the query medicine isPiperis Longi Fructus.Its retrieved reference medicines are all hot nature.The calculated cold nature probability is 0,which indicates that the query medicine maybe is hot nature.In the second column,the query medicine isKochiae Fructus.The retrieved results have six cold nature medicines and one hot nature medicine.Its cold nature probability is 0.9464,indicating the query medicine is more likely to be cold nature.The prediction examples demonstrate thatsimilar UV fingerprints can characterize the same medicine nature.

Table 3.Prediction examples based on the proposed RS

Overall Prediction Performance.

In this study,we perform a holistic assessment of the proposed RS method.Table 4 shows the prediction confusion matrix of 61 CMs.The total prediction accuracy is 80.3%(49/61).The identification accuracy of cold nature medicine is86.7% (26/30),whilethe prediction accuracy of hot nature medicine is 74.2(23/31).It can be seen that this scheme has a good prediction rate for medicines with cold nature.The recall,precision and F-score of 61 CM identification are listed in Table 5.Generally,our scheme has good identification rate.

Table 4.Confusion matrix of 61 CM identification

Table 5.The recall,precision and F-score of 61 CM identification

Robustness of the proposed method

An independent dataset is used to test the robustness of the proposed algorithm.In thisdataset,molecular descriptors are calculated to represent the CMs,including Molweight,H.Acceptors,H.Donors,Polar.Surface.Area,Rotatable.Bonds,Sp3.Atoms,Symmetric.atoms and Amines.The detailed process has been described in the manuscript[10].In the dataset,there are 534 hot medicines and 724 cold medicines.Table 6 shows the identification confusion matrix of 1258 CMs(534+724 medicines).The total identification accuracy is 81.1%(1020/1258).The prediction accuracy of cold nature medicines is 83.0%(443/534),while the identification accuracy of hot nature medicines is 79.7%(577/724).The experimentalresults demonstrate thatthe proposed method has better robustness.Generally,our scheme has good prediction rate.

Table 6.Confusion matrix of 61 CM identification

Discussion

In this study,we have explored the feasibility of classifying CM nature with a retrieval scheme on the basis of the similarity of UV spectral data.Experiment results have demonstrated that it is an effective method for identifying the unknown CM nature by calculating the similarity of the UV spectrum.Meanwhile,the experimental results verify the proposed hypothesis that CMs with similar composition of substances should have similar medicinal nature.

In summary,the advantages of our research are as follows.First,to realize CM nature identification,a dataset of 61 reference CM UV spectrum is assembled in which each CM has clearly cold or hot nature.Thus,it is effective and feasible for CM nature determination.

Second,cold-hot nature plays a critical role in TCM nature theory.In this study,we investigate the interrelationship between material composition within CM and cold-hotnature.Materialco mposition is represented by UV spectra.Experiment evaluations have illustrated that there is a correlation between material composition and cold-hot medicine nature,which can be applied for cold-hot nature classification.Furthermore,we demonstrate that material composition determines the CM cold-hot nature.

Third,in the light of UV spectral characteristics of CM,we investigate a retrieval scheme to identify CM nature.The distance metric is studied to measure the similarity of UV spectra.Experiment results display that our scheme performs best.The potential explanation is that our scheme sufficiently explores the relationship between material composition and CM cold-hot nature.

Fourth,another performance that has been thoroughly demonstrated in this study is the robustness of the proposed retrievalscheme for the future clinical applications.For an intelligent discriminant model,our goal is to assist researchers in reading ultraviolet spectra and identifying cold-hot nature.The model is not feasible if the robustness is too low for an independent TCM dataset.We have demonstrated that our model has a high robustness in the experiments.More CM fingerprint data will be extracted to confirm the robustness of our model in the future.

However,our research still has some limitations.First,this study only used UV spectra to represent the CMs.Other fingerprint techniques are not analyzed in this study.The CMs are complex mixtures of compounds.It is impossible to reflect the whole composition of CM compounds by only one fingerprint technique.In the future,we want to use multiple fingerprints to analyze CM cold-hot nature.Second,we investigate the similarity of UV spectra with a distance metric.The fingerprint data have the characteristics of high dimension and small sample.Based on such characteristics,the design of forecasting model is the focus in the future.Third,our study focuses on exploring retrieval scheme for cold-hot nature classification.UV spectrum features have not been thoroughly analyzed.Subsequently,we will integrate more effective fingerprint data to improve medicine nature classification performance.

Ourstudy givesnot only a method for nature identification,but a new scheme for nature marker of Chinese medicines.Nature marker is a novel concept,indicating the ingredients of Chinese medicines closely related to medicine natures.With our nature identification scheme,we want to look for Chinese medicines with the similar ingredients under the same nature restriction conditions.Such several Chinese medicines have the same ingredients,which can be considered as the nature markers of these several Chinese medicines.

Conclusion

In this study,a retrieval scheme is proposed to predict cold-hot medicine nature.Based on the characteristics of CM,this scheme has better classification performance than classical classifiers.Effective experiments demonstrate that cold-hot medicine nature and UV spectral fingerprint data are relevant.