朱志臣 王 强 贾青竹 汤红梅 马沛生
(1天津城市建设学院理学院,天津300384;2天津科技大学材料科学与化学工程学院,天津300457;3天津大学化工学院,天津300072)
Critical micelle concentration(CMC)is an important defining property of a surfactant.At CMC,many of the surfactant solution's physicochemical properties,such as surface tension,interfacial tension,conductivity,osmotic pressure,detergency,emulsification,foaming,etc.,change dramatically.Hence,CMC can be regarded as one of the most useful parameters for the characterization of surfactants,and can be correlated with many industrially important properties.1-7Therefore,it is necessary to predict CMC directly from molecular structure.
CMC depends on molecular structure.Therefore,on the basis of experimental data,many empirical equations relating the CMC to the various structural units in surfactants were obtained.8,9These empirical relationships are simple to calculate,but not suitable for the general estimation of CMC for compounds,other than the linear alkyl or alkylphenol ethoxylates.
With increased computational power and the development of modern quantitative structure-activities/properties relationship(QSAR/QSPR)approaches,significant methods for the prediction of CMC have eventually become available.10-21For instance,using the program CODESSA(comprehensive descriptors for structural and statistical analysis),Huibers et al.10proposed a three-parameter QSPR model for a set of 77 non-ionic surfactants using only topological descriptors.Li et al.12reported a three-parameter regression equation with good statistical characteristics,which involves variables such as the maximum atomic charge on the carbon atom,and the dipole moment.In another study,Katritzky et al.13proposed a general QSPR model for a wide range of sodium salts,potassium alkanecarboxylates,and p-isooctylphenol ethoxylated phosphates.In David's research,15based on about 233 topological descriptors and other set of 175 descriptors,the final ADAPT model using 5 terms,and Molconn model using 7 terms,have been developed for lg(CMC)prediction of 175 surfactants of diverse chemical structure.More recently,by using heuristic algorithm and multiple linear regression(MLR),Guo et al.16developed a linear model for CMC prediction for a panel of 120 structurally diverse Gemini surfactants.However,not a single topological index could universally be used in optimal correlations yet,i.e.,the uniform applicability of topological indices to compounds of wide structural diversity still presents many difficulties.For instance,no rigorous method has been found to calculate topological indices for compounds containing hetero-atoms.Topological indices also fail to account for hydrogen-bonding and long-range effects in condensed media.Therefore,it is absolutely necessary for the researcher to see if a single set of descriptor or a single topological index can be used to build a universal model,in order to predict good values for all properties.
Recently,we proposed a universal positional distributive group contribution(PDGC)theory for the prediction of various properties(critical temperature,melting point,vaporization enthalpy and so on)of a diverse set of organic compounds.22-27PDGC method,in comparison with the currently used prediction methods,could significantly make improvements both in accuracy and stability.The previous works22-27suggested the possibility of using a totally same universal framework for the prediction of the critical properties and the thermodynamic properties of organic compounds containing various functionalities.Recently,a topological index based on atom characters(e.g.,atom radius and atom electro negativity)has been proposed by our research group.And it has been successfully used for predicting the decomposition temperature,toxicity of ionic liquids and so on.28-30
Therefore,the major objectives of this study are:(i)to propose a new topological index;(ii)to establish a more general QSPR model for prediction of CMC for more structurally diverse molecules;(iii)to compare the performance of our model with other methods.
The CMC data used in this work are based on the 175 anionic surfactants,which are provided in Table S1(shown in Supporting Information).Using the method described by Huibers et al.,11the CMC values used,were observed at 40°C in pure water,or were observed at 25 °C and adjusted to 40 °C.The lg(CMC)was used as the dependent variable in the modeling work.In order to keep this method consistent with the other comparison methods,15153 surfactants were chosen as training set members and 22 surfactants were chosen as testing set members.
Structure building and the entire modeling work was carried out using the software of Heperchem.7.0.Energy minimization of the molecules was performed using Allinger's MM force field followed by semi empirical PM3 method.The most stable structure for each compound was generated and used for calculating various physicochemical descriptors like thermodynamic,steric,and electronic values of descriptors.
Based on chemical graphs,a new topological index was introduced.This newly proposed topological index was an extended distance matrix,from which the extended adjacency matrix,the extended interval matrix,and the extended interval jump matrix were further deduced.The new topological index was considered to describe detailed structural information of molecules,concerning parameters such as eletronegativity fraction,van der Waals radius,minimum bond length with adjacent atoms except hydrogen,number of adjacent hydrogen atoms,and the number of adjacent atoms except hydrogen.All these parameters are crucial ingredients for modeling physicochemical properties of molecules.
Matrixes considered in this work are as follows.
Md=(aij),distance matrix aij=n
(if the path length between atoms i and j is n)
Following shows the constituents of the extended matrix,Me.
Thus,an extended distance matrix MD,which is the new topological index of this work,is defined as below.
The extended adjacency matrix MA,the extended interval matrix MB,and the extended interval jump matrix MCare deduced,respectively.
The research includes multiple linear regression analyses of molecular descriptors and the logarithm of the CMC.Based on this new topological index,our approach is performed with ordinary least squares(OLS)regression.
Therefore,using the new topological index,the QSPR model for lg(CMC)prediction is expressed as Eq.(1).
Here,
where,norm(MD,1)means the largest column sum of matrix MD,norm(MD,2)means the largest singular value of matrix MD,norm(MD,fro)is the frobenius-norm of matrix MD,tanh means hyperbolic tangent function,N0for total number of atoms with the exception of the hydrogen atoms,N for total number of atoms,MWis molecular weight,M0is the constant added,a,b1,b2,and b3,are regression parameters,which are summarized in Table 1.
The lg(CMC)prediction results with this QSPR model(Eq.(1))are listed in Table S1(shown in Supporting Information),and the calculated vs experimental lg(CMC)scatter plot for this regression is presented in Fig.1.Using this model,the average relative differences(ARDs)for lg(CMC)prediction of train-ing set and testing set are 8.20%and 6.76%,respectively.Also,this high-quality prediction model is further evidenced by R2value of 0.9295 for the training set,and 0.9257 for the testing set.On the whole,results shown in Table S1 and Fig.1 indicate that the predicted lg(CMC)agree well with“experimental results”,which demonstrates that this QSPR model based on this new topological index has good overall accuracy for predicting the lg(CMC).In addition,in order to more clearly elucidate the application of the proposed method,a detailed procedure for the lg(CMC)estimation is given and shown in Supporting Information.
Table 1 Parameters for prediction of lg(CMC)based on this model for the set of 175 surfactants
This QSPR model for the lg(CMC)prediction is compared with the ADAPT model and the Molconn model from David's research.15Comparison results for lg(CMC)prediction are also shown in Table S1 and some statistical metrics calculated using Eqs.(2-7)are listed in Table 2.PRESS stands for the predicted sum of squared(error),RD means relative deviation,AD stands for the absolute difference,and ARD is the average relative difference.Results shown in Table 2 denote that the new model proposed in this work(R2=0.9295)is more reliable than ADAPT model(R2=0.8765)and Molconn model(R2=0.9184).
Fig.1 Scatter plot showing the correlation between lg(CMC)predicted by this model and experimental lg(CMC)for a diverse set of 175 surfactant structures
Table 2 Summary of regression results for prediction of lg(CMC)for this set of surfactants,based on this model,the ADAPT model,the Molconn model,and predicting ability tested by leave-one-out cross-validation
It should be stressed that the new topological index proposed in this work is composed of two parts.The first part is the molecular distance matrix by which the structure of molecules could be described objectively and quantitatively.While the second part is the extended matrix(established also in this work)from which the structure and the composition of molecules could be further identified in detail.Thus,based on the molecular distance matrix and the extended matrix,a new topological index,named as“the extended distance matrix”,is proposed.With the help of this topological index,the structure and composition of molecules could be determined stably,accurately,and completely,and also,the isomers could be distinguished well.In addition,in order to improve the prediction effect,hyperbolic tangent function has also been used for modeling development,which is consistent with our previous works.22-27
Fig.2 Scatter plot showing the correlation between lg(CMC)predicted by leave-one-out cross-validation and experimental lg(CMC)for surfactant structures
Fig.3 Distributions of the relative derivation(RD)by the model and leave-one-out cross-validation
Most importantly,in this method,18 indices are considered for modeling development for lg(CMC)prediction,out of them 15 indices are deduced from one topological index.However,in David's research,15about 233 topological descriptors were initially used for Molconn model-development exercise,and a set of 175 descriptors was computed and used to develop the ADAPT model.From the above discussion,it is clear that the number of descriptors considered for modeling work in our method is less than that of David's method.Moreover,in such a instance,our model has resulted in a better prediction effect for lg(CMC).Therefore,it could be demonstrated that this method with the new topological index could result in significant improvements both in accuracy and in stability for predicting lg(CMC).
The most frequently used technique for the validation of the prediction of QSPR models is the leave-one-out algorithm.Results of leave-one-out cross-validation of this model are also listed in Table 2 and the calculated values by leave-one-out cross-validation and the experimental data of lg(CMC)are compared in Fig.2.By using leave-one-out cross-validation method,R2and PRESS values are 0.9216 and 11.0512,which are acceptable and are as good as the results calculated by Eq.(1).Additional,the RD distributions of leave-one-out cross validation are also compared with those of Eq.(1),and comparison results are shown in Fig.3.Fig.3 shows that the RD distributions of Eq.(1)are similar to those of leave-one-out crossvalidation.As a consequence,compared with the leave-oneout cross validation,it could be demonstrated that our QSPR model(Eq.(1))based on this new topological index has good predictive stability and reliability for predicting the lg(CMC).
Based on chemical graphs,a new topological index for lg(CMC)prediction of surfactants,having diverse chemical structure,is proposed.A stable and accurate QSPR model has been developed based on 18 indices,out of them 15 indices were deduced from the new proposed topological index.Results indicate that lg(CMC)can successfully be predicted with the new model.A high-quality prediction model is evidenced by R2value of 0.9295 for the training set,and 0.9257 for the testing set,and the ARD values for lg(CMC)prediction of training set and testing set are 8.20%and 6.76%,respectively.The research provides better prediction results as compared to the ADAPT model and Molconn model,15despite the fact that less descriptors were considered and used in this research as compared to other models.Leave one-out cross validation further demonstrates that this QSPR model based on the new topological index has good predictive stability and reliability for predicting the lg(CMC).Moreover,the prediction results demonstrated that the proposed topological index could be used to predict the lg(CMC)for surfactants of diverse chemical structure with a significant degree of confidence.
Supporting Information: Experimentaland predicted lg(CMC)based on this model,the ADAPT and the Molconn models,for the set of 175 surfactants,and a detailed procedure for the lg(CMC)estimation have been included.This information is available free of charge via the internet at http://www.whxb.pku.edu.cn.
(1) Katritzky,A.R.;Kuanar,M.;Slavov,S.;Hall,C.D.;Karelson,M.;Kahn,I.;Dobchev,D.A.Chem.Rev.2010,110,5714.doi:10.1021/cr900238d
(2)Huang,Z.J.;Tan,C.H.;Huang,X.G.Acta Phys.-Chim.Sin.2010,26(5),1271.[黄振健,谭春华,黄旭光.物理化学学报,2010,26(5),1271.]doi:10.3866/PKU.WHXB20100535
(3) Liu,J.X.;Guo,Y.J.;Zhu,Y.W.;Yang,H.P.;Yang,X.S.;Zhong,J.H.;Li,H.B.;Luo,P.Y.Acta Phys.-Chim.Sin.2012,28(7),1757.[柳建新,郭拥军,祝仰文,杨红萍,杨雪杉,钟金杭,李华兵,罗平亚.物理化学学报,2012,28(7),1757.]doi:10.3866/PKU.WHXB201204231
(4) Han,L.J.;Ye,Z.B.;Chen,H.;Luo,P.Y.Acta Phys.-Chim.Sin.2012,28(6),1405.[韩利娟,叶仲斌,陈 洪,罗平亚.物理化学学报,2012,28(6),1405.]doi:10.3866/PKU.WHXB 201203202
(5)Wu,X.N.;Zou,W.S.;Zhao,J.X.Acta Phys.-Chim.Sin.2012,28(5),1213.[吴晓娜,邹文生,赵剑曦.物理化学学报,2012,28(5),1213.]doi:10.3866/PKU.WHXB201203053
(6) Zhao,J.M.;Li,J.Acta Phys.-Chim.Sin.2012,28(3),623.[赵景茂,李 俊.物理化学学报,2012,28(3),623.]doi:10.3866/PKU.WHXB201112293
(9) Ravey,J.C.;Gherbi,A.;Stebe,M.J.Prog.Colloid Polym.Sci.1988,76,234.doi:10.1007/BFb0114157
(10) Huibers,P.D.T.;Lobanov,V.S.;Katritzky,A.R.;Shah,D.O.;Karelson,M.Langmuir 1996,12,1462.doi:10.1021/la950581j
(11) Huibers,P.D.T.;Lobanov,V.S.;Katritzky,A.R.;Shah,D.O.;Karelson,M.J.Colloid Interface Sci.1997,187,113.doi:10.1006/jcis.1996.4680
(12) Li,X.F.;Zhang,G.Y.;Dong,J.F.;Zhou,X.H.;Yan,X.C.;Luo,M.D.J.Mol.Struct.2004,710,119.
(13) Katritzky,A.R.;Pacureanu,L.;Dobchev,D.;Karelson,M.J.Chem.Inf.Model.2007,47,782.doi:10.1021/ci600462d
(14) Roberts,D.W.Langmuir 2002,18,345.doi:10.1021/la0108050
(15) David,T.S.J.Comput.Aided Mol.Des.2008,22,441.doi:10.1007/s10822-008-9204-9
(16)Guo,C.;Zhou,P.;Shao,J.;Yang,X.;Shang,Z.Chemosphere 2011,84(11),1608.doi:10.1016/j.chemosphere.2011.05.031
(17)Anna,M.;Bozenna,R.R.J.Math.Chem.2011,49,276.doi:10.1007/s10910-010-9738-7
(18)Kunal,R.;Humayun,K.Chem.Eng.Sci.2012,81,169.doi:10.1016/j.ces.2012.07.008
(19) Yuan,S.;Cai,Z.;Xu,G.;Jiang,Y.J.Disper.Sci.Technol.2002,23,465.doi:10.1081/DIS-120014014
(20) Kardanpour,Z.;Hemmateenejad,B.;Khayamian,T.Anal.Chim.Acta 2005,531,285.doi:10.1016/j.aca.2004.10.028
(21) Katritzky,A.R.;Pacureanu,L.M.;Slavov,S.H.;Dobchev,D.A.;Shah,D.O.;Karelson,M.Comput.Chem.Eng.2009,33,321.doi:10.1016/j.compchemeng.2008.09.011
(22)Wang,Q.;Ma,P.S.;Jia,Q.Z.;Xia,S.Q.J.Chem.Eng.Data 2008,53,1103.doi:10.1021/je700641j
(23) Wang,Q.;Jia,Q.Z.;Ma,P.S.J.Chem.Eng.Data 2008,53,1877.doi:10.1021/je800207c
(24)Wang,Q.;Ma,P.S.;Wang,C.;Xia,S.Q.Chin.J.Chem.Eng.2009,17,254.doi:10.1016/S1004-9541(08)60202-5
(25) Wang,Q.;Jia,Q.Z.;Ma,P.S.J.Chem.Eng.Data 2009,54,1916.doi:10.1021/je9001152
(26) Wang,Q.;Jia,Q.Z.;Ma,P.S.J.Chem.Eng.Data 2009,54,1916.doi:10.1021/je9001152
(27) Wang,Q.;Jia,Q.Z.;Ma,P.S.J.Chem.Eng.Data 2012,57,169.doi:10.1021/je200971z
(28)Yan,F.;Xia,S.;Wang,Q.;Ma,P.S.J.Chem.Eng.Data 2012,57,805.doi:10.1021/je201023a
(29)Yan,F.;Xia,S.;Wang,Q.;Ma,P.S.J.Chem.Eng.Data 2012,57,2252.doi:10.1021/je3002046
(30)Yan,F.;Xia,S.;Wang,Q.;Ma,P.S.Ind.&Eng.Chem.Res.2012,doi:dx.doi.org/10.1021/ie301764j