Research Journal of
Chemical
Vol. 4(7), 2429, July (2014)
International Science Congress Association
Comparative Evaluation of Multiple Linear Regression and Support vector
Machine aided Linear and Non
Joshi Shobha
1
Govt. Shaheed Bhagirath
2
Dept. of Chemistry, Govt. Holkar Science College, Indore, MP, INDIA
3
Dept. of Pharmaceutical Chemistry, Softvision College, Indore, MP, INDIA
Available online at:
Received 16th
Abstract
Type 2 diabetes still remains a major challenge to human health management. Protein tyrosine phosphate 1B has been
continuously explored for its therapeutic potential to treat type 2 diabetes as it is linked with negative regulation of insu
signal trans
duction. QSAR studies were performed on derivatives of 2
and SVM aided linear and non
linear models were obtained which were further evaluated to identify descriptors revealing
underlying structureactivit
y relationship. QSAR models were validated through a series of validation techniques like Y
randomization and descriptor sensitivity in addition to internal validation parameters. Information content index (IC1) of
neighbourhood symmetry of order
1 has bee
activity relationship of 2
arylsulphonylaminobenzothiazoles derivatives. Geary auto
polarizability are also actively correlated to biological respo
Keywords: Type 2 diabetes, t
yrosine phosphate 1B inhibitors, Linear and non
Introduction
Mellitus, the type 2 diabetes is chronic and progressive disease
of metabolic disorder. Obesity and insulin resistance are the
very common risk factors of developing type 2 diabetes
mellitus. Diabetes involves the high level of glucose in blood
plasma. Man
y people with type 2 diabetes mellitus have
hypertension and high level of cholesterol. All of these factors
can cause the long term complication such as neuropathy,
retinopathy, nephropathy and cardiovascular disorder
Protein tyrosine phosphate 1B was separated through the
process of distillation from human placental tissue in 1988 and
crystallized in 1994
. Protein tyrosine phosphate 1B is the most
flourishing molecular level rational therapeutic target in the
efficac
ious direction of treatment of type 2 diabetes mellitus.
Cytosolic nonreceptor protein tyrosine phosphatase, PTP1B, is
key factor in the negative regulation of insulin signal
transduction4,5
PTP1B inhibitors block the PTP1B mediated
negative insulin sign
al transduction and leads to stimulation of
insulin activity. Therefore, they can be considered as the most
fascinating target for type 2 diabetes mellits
6,7
PTP1B are classified into four categories: difluoromethylene
phosphates, 2carb
omethoxybenzoicacid, 2
benzoic and hydrophobic compound.
QSAR methods attempt to find out the relationship between end
point (Biological activity) and chemical structures, which allows
the prediction of potency of drug810.
Machine learning
field of artificial intelligence associated with study of computer
Chemical
Sciences _________________________________
______
International Science Congress Association
Comparative Evaluation of Multiple Linear Regression and Support vector
Machine aided Linear and Non

linear QSAR Models
Joshi Shobha
, Sharma Sonal, Yadav Mukesh3*
Govt. Shaheed Bhagirath
Silawat College, Depalpur, Indore, MP, INDIA
Dept. of Chemistry, Govt. Holkar Science College, Indore, MP, INDIA
Dept. of Pharmaceutical Chemistry, Softvision College, Indore, MP, INDIA
Available online at:
www.isca.in, www.isca.me
May 2014, revised 17th June 2014, accepted 14th July 2014
Type 2 diabetes still remains a major challenge to human health management. Protein tyrosine phosphate 1B has been
continuously explored for its therapeutic potential to treat type 2 diabetes as it is linked with negative regulation of insu
duction. QSAR studies were performed on derivatives of 2

arylsulphonylaminobenzothiazoles. MLR aided linear
linear models were obtained which were further evaluated to identify descriptors revealing
y relationship. QSAR models were validated through a series of validation techniques like Y
randomization and descriptor sensitivity in addition to internal validation parameters. Information content index (IC1) of
1 has bee
n found to be a key molecular descriptor participating and regulating structure
arylsulphonylaminobenzothiazoles derivatives. Geary auto
correlated atomic masses and
polarizability are also actively correlated to biological respo
nse of tyrosine phosphate 1B inhibitors.
yrosine phosphate 1B inhibitors, Linear and non

linear QSAR models, MLR, SVM.
Mellitus, the type 2 diabetes is chronic and progressive disease
of metabolic disorder. Obesity and insulin resistance are the
very common risk factors of developing type 2 diabetes
mellitus. Diabetes involves the high level of glucose in blood
y people with type 2 diabetes mellitus have
hypertension and high level of cholesterol. All of these factors
can cause the long term complication such as neuropathy,
retinopathy, nephropathy and cardiovascular disorder
etc1,2.
Protein tyrosine phosphate 1B was separated through the
process of distillation from human placental tissue in 1988 and
. Protein tyrosine phosphate 1B is the most
flourishing molecular level rational therapeutic target in the
ious direction of treatment of type 2 diabetes mellitus.
Cytosolic nonreceptor protein tyrosine phosphatase, PTP1B, is
key factor in the negative regulation of insulin signal
PTP1B inhibitors block the PTP1B mediated
al transduction and leads to stimulation of
insulin activity. Therefore, they can be considered as the most
6,7
. The inhibitors of
PTP1B are classified into four categories: difluoromethylene
omethoxybenzoicacid, 2
oxalyl amino
QSAR methods attempt to find out the relationship between end
point (Biological activity) and chemical structures, which allows
Machine learning
is the
field of artificial intelligence associated with study of computer
algorithm that improves on its own through experience
are few machine learning approaches such as ANN
Decision Tree14
and Bays Classifier
m
odelling while multiple linear regression is most extensively
used method to construct QSAR models
Methodology
Dataset of twenty seven (27) derivatives of 2
arylsulphonylaminobenzothiazole were taken from literature
for QSAR study. The 3D structures of molecules were drawn by
software Marvin Sketch 5.1.5 (developed by Chemaxon Ltd.)
Structural details and experimental biological activity are
reported in table1. Figure

Structure of 2
arylsulphonylaminobenzothia
of derivatives
______
_______ ISSN 2231606X
Res. J. Chem. Sci.
24
Comparative Evaluation of Multiple Linear Regression and Support vector
linear QSAR Models
Type 2 diabetes still remains a major challenge to human health management. Protein tyrosine phosphate 1B has been
continuously explored for its therapeutic potential to treat type 2 diabetes as it is linked with negative regulation of insu
lin
arylsulphonylaminobenzothiazoles. MLR aided linear
linear models were obtained which were further evaluated to identify descriptors revealing
y relationship. QSAR models were validated through a series of validation techniques like Y

randomization and descriptor sensitivity in addition to internal validation parameters. Information content index (IC1) of
n found to be a key molecular descriptor participating and regulating structure
–
correlated atomic masses and
nse of tyrosine phosphate 1B inhibitors.
linear QSAR models, MLR, SVM.
algorithm that improves on its own through experience
11. There
are few machine learning approaches such as ANN
12SVM13,
and Bays Classifier
15 which is used in QSAR
odelling while multiple linear regression is most extensively
used method to construct QSAR models
16
Dataset of twenty seven (27) derivatives of 2

arylsulphonylaminobenzothiazole were taken from literature
17
for QSAR study. The 3D structures of molecules were drawn by
software Marvin Sketch 5.1.5 (developed by Chemaxon Ltd.)
18.
Structural details and experimental biological activity are

1
arylsulphonylaminobenzothia
zole and scheme
of derivatives
Research Journal of Chemical Sciences ___________________________________________________________ ISSN 2231606XVol. 4(7), 2429, July (2014) Res. J. Chem. Sci. International Science Congress Association
25
Table1 Structural details and experimental biological activity for derivatives 127 Compound R
1
R
2
Antiobesity effect(Pa/Pi)
1 CH
3
H 0.892
2 CH
3
CH
3
0.897
3 CH
3
OCH
3
0.853
4 CH
3
NO
2
0.835
5 CH
3
NHCOCH
3
0.819
6 CH
3
Cl 0.884
7 OCH
3
H 0.889
8 OCH
3
CH
3
0.863
9 OCH
3
OCH
3
0.894
10 OCH
3
NO
2
0.833
11 OCH
3
NHCOCH
3
0.809
12 OCH
3
Cl 0.882
13 OCH
2
CH
3
H 0.880
14 OCH
2
CH
3
CH
3
0.858
15 OCH
2
CH
3
OCH
3
0.858
16 OCH
2
CH
3
NO
2
0.817
17 OCH
2
CH
3
NHCOCH
3
0.813
18 OCH
2
CH
3
Cl 0.872
19 NO
2
H 0.888
20 NO
2
CH
3
0.860
21 NO
2
OCH
3
0.845
22 NO
2
NO
2
0.893
23 NO
2
NHCOCH
3
0.809
24 NO
2
Cl 0.880
25 NO
2
F 0.861
26 F NO
2
0.836
27 Cl NO
2
0.881
Descriptors for each derivative of corresponding compound were computed by EDragon software19. A large pool of significant descriptors was calculated for each molecule. Highly correlated descriptors and descriptors, having constant values, missing values or zero value were removed in pruning. In QSAR study of 2arylsulphonylaminobenzothiazole derivatives molecular descriptors subset was obtained by forward selection method20. Linear QSAR models were developed, using the most simple and popular method multiple linear regression and support vector machine aided linear method while nonlinear QSAR models were developed using Gaussian kernel function aided support vector machine21. Support vector machine classifies data by constructing best hyper plane by applying kernel trick to separate molecules into two classes22. There are three types of kernel functions are available linear, Gaussian and polynomial. Robustness of QSAR models, obtained after linear and nonlinear regressions was evaluated by using leave one out internal cross validation (RCV) and predictive error sum of square PRESS. To protect against chance correlation Yrandomization method was performed23,24. Descriptor sensitivity analysis was performed to identify the most sensitive descriptor25. Results and Discussion Multiple linear regressions: In stepwise multiple linear regressions we have obtained three significant QSAR models. Trivariable model was selected as most significant model and it is represented below equation1. Pa/pI= 2.1464 0.7575 (0.8735) [GATS3m]  0.1624 (0.0093) [IC1] + 0.0056 (0.0014) [RDF105m] (1)N = 27, R = 0.9393, RA = 0.9303, F = 116.63, S.E. = 0.014, 2 CV = 0.9117, SPRESS = 0.1905, RSS = 0.0196 Herein, N = number of compounds, R = coefficient of determination, RA is adjusted R, F = Fisher’s statistics and S.E. = standard error. RCV = leave one out (LOO) cross validation parameter and SPRESS = standard deviation based on predictive error sum of square, RSS is residual sum of square. Among three QSAR models statistical significant value of RCV (0.9117), lowest value of PRESS (0.1905), highest value of RSS (0.0196) prove trivariable model as the most predictive and statistically fit model. SVM regression: We have used linear kernel function for SVM aided linear regression and Gaussian kernel function for SVM
Research Journal of Chemical Sciences ___________________________________________________________ ISSN 2231606XVol. 4(7), 2429, July (2014) Res. J. Chem. Sci. International Science Congress Association
26
aided nonlinear regression at fixed value of cost function C (100),  insensitive loss function (0.1) and sigma (0.1). Statistical parameters R, S.E., and validation parameters CV,PRESS and RSS for MLR and SVM aided linear and nonlinear regressions are summarised in table2. Descriptor IC1 (neighbourhood symmetry of 1order) was selected for model building by MLR and SVM aided linear and nonlinear regression methods. IC1 belongs to information content descriptors. Descriptor SIC3 (Structural information content neighbourhood symmetry of order3) was selected in SVM aided linear models. Descriptor RDF105u (Radial distribution function10.5/unweighted) belongs to radial distribution function descriptors and was selected in MLR QSAR models. Descriptor GATS3m (Geary autocorrelation – lag3 weighted by atomic masses) was selected in MLR aided linear and SVM aided nonlinear QSAR models. Descriptor GATS3m belongs to 2D Autocorrelation indices. GATS5p (Geary autocorrelation – lag5 weighted by atomic polarizability) belongs to 2D Autocorrelation indices was selected in SVM aided non linear modelling. Descriptor Mor07v (signal 7 weighted by atomic Vander walls volume) belongs to 3D MORSE descriptors. It has been found to play significant role in SVM aided linear QSAR models. All descriptors selected in MLR and SVM aided linear and nonlinear models are listed in table3. Validation: Calculated biological activity Pa/Pi in MLR and SVM aided linear and nonlinear QSAR models show good correlation with experimental activity and show robustness of models table4 and figure2. Table2 Statistical fitness parameters and validation parameters used in MLR and SVM aided linear and nonlinear QSAR models
MLR SVM aided linear QSAR SVM aided nonlinear
Model 1 2 3 1 2 3 1 2 3
0.731 0.892 0.939 0.707 0.865 0.925 0.770 0.872 0.954
SE 0.030 0.018 0.014 0.033 0.024 0.017 0.031 0.031 0.031
RSS 0.006 0.020 0.020 0.006 0.003 0.002 0.005 0.005 0.005
cv 0.689 0.866 0.912 0.686 0.844 0.910 0.746 0.852 0.915
PRESS 0.192 0.191 0.191 0.192 0.192 0.192 0.192 0.192 0.192
Table3 Descriptors used in MLR and SVM aided linear and nonlinear modelling Descriptors MLR
IC1 Information content index(neighborhood symmetry of 1order)
GATS3m Geary autocorrelationlags3/weighted by atomic masses
RDF105m Radial Distribution Function  10.5 / weighted by atomic masses
Descriptors SVM aided linear QSAR study
Mor07v 3D mores signal 7/weighted by atomic Vander Waals volumes
IC1 Information content index(neighborhood symmetry of 1order)
SIC3 structural information content (neighborhood symmetry of 3order)
Descriptors SVM aided nonlinear QSAR study
IC1 Information content index(neighborhood symmetry of 1order)
GATS3m Geary autocorrelationlags3/weighted by atomic masses
GATS5p Geary autocorrelationlags5/weighted by atomic polarizability
Research Journal of Chemical Sciences ___________________________________________________________ ISSN 2231606XVol. 4(7), 2429, July (2014) Res. J. Chem. Sci. International Science Congress Association
27
Table4 Experimental activity and calculated activity of 2arylsulphonyl amino benzothiazole derivatives for MLR and SVM aided linear and nonlinear modelling. Compound Experimental (Antiobesity) MLR SVM aided linear SVM aided nonlinear
Calculated (antiobesity) Calculated (antiobesity) Calculated (antiobesity)
1 0.892 0.893 0.894 0.895
2 0.897 0.894 0.903 0.892
3 0.853 0.863 0.852 0.862
4 0.835 0.841 0.839 0.840
5 0.819 0.828 0.826 0.826
6 0.884 0.883 0.885 0.887
7 0.889 0.883 0.890 0.888
8 0.863 0.866 0.866 0.864
9 0.894 0.880 0.894 0.874
10 0.833 0.829 0.836 0.827
11 0.809 0.811 0.816 0.808
12 0.882 0.875 0.882 0.881
13 0.880 0.869 0.873 0.874
14 0.858 0.860 0.854 0.857
15 0.858 0.866 0.856 0.857
16 0.817 0.821 0.823 0.819
17 0.813 0.810 0.805 0.805
18 0.872 0.866 0.867 0.873
19 0.888 0.902 0.888 0.891
20 0.860 0.854 0.843 0.856
21 0.845 0.842 0.828 0.851
22 0.893 0.890 0.878 0.878
23 0.809 0.811 0.809 0.809
24 0.880 0.894 0.877 0.880
25 0.861 0.852 0.847 0.860
26 0.836 0.836 0.847 0.839
27 0.881 0.878 0.865 0.879
Figure2 Correlation of experimental vs. calculated antiobesity of (a) MLR aided trivariable model, (b) SVM aided linear trivariable model, (c) SVM aided nonlinear trivariable model
Research Journal of Chemical Sciences ____
_
Vol. 4(7), 2429, July (2014)
International Science Congress Association
Figure
2a represents linear relationship of the experimental
activity with calculated activity for MLR aided tri
model. Prediction of biological activity is found to be
satisfactory with R
value (0.9393) while figure
linear relationship (R
= 0.925) of experimental activity with
calculated ac
tivity for SVM aided linear tri
Non
linear relationship of experimental activity with calculated
activity for SVM aided nonlinear tri
variable model is
presented in figure2c. R
value (0.954) shows good predictive
power of model. YRandomization:
In order to avoid by chance modelling
have performed Y
randomization method by repeated
scrambling of biological activity. Each model was undertaken to
100 times replicate runs. Low values of correlation coefficient
for all 100 models derived from Y
scrambling recommend that
the generated
model is not by chance. Graphical representations
of Yrandomization for MLR tri
variable model and SVM aided
linear and nonlinear tri
variable models are reported in figure
3. Descriptor sensitivity analysis:
Most sensitive descriptor was
evaluated by
descriptor sensitivity analysis
Y
Scrambling graphs for (a) MLR aided tri
Descriptor sensitivity (a) MLR aided tri
variable model, (b) SVM aided linear tri
_
_____________________________________________
_
International Science Congress Association
2a represents linear relationship of the experimental
activity with calculated activity for MLR aided tri
variable
model. Prediction of biological activity is found to be
value (0.9393) while figure
2b represents
= 0.925) of experimental activity with
tivity for SVM aided linear tri
variable model.
linear relationship of experimental activity with calculated
variable model is
value (0.954) shows good predictive
In order to avoid by chance modelling
we
randomization method by repeated
scrambling of biological activity. Each model was undertaken to
100 times replicate runs. Low values of correlation coefficient
scrambling recommend that
model is not by chance. Graphical representations
variable model and SVM aided
variable models are reported in figure

Most sensitive descriptor was
descriptor sensitivity analysis
25. Descriptor
GATS3m was found as most sensitive descriptor with high area
under the curve (2.839) in multiple linear regression models. In
SVM aided linear and non
linear QSAR models descriptor IC1
was found as the most se
nsitive descriptor with area under the
curve respectively (7.810 and 0.249).
Conclusion
In the present study we have compared the performance of MLR
and SVM aided linear and non

models. Models obtained from SVM
were found statistically fit and more predictive
obtained from multiple linear regression and SVM aided linear
regression. Descriptors (IC1, SIC3) and (GATS3m, GATS5p),
used in MLR and SVM aided linear and non
are from same class of descriptors. These descriptors contribute
to structure
activity relationship and code for same chemical
structure property but differ in linear and non
relationship. Descriptor IC1 was found as the most sensitive
descriptor with highest area under the curve. Descriptor
sensitivity analysis illustrates the importance of molecular
descriptors. Figure3
Scrambling graphs for (a) MLR aided tri
variable model, (b) SVM aided linear tri
variable model, (c) SVM
linear trivariable model Figure3
variable model, (b) SVM aided linear tri

variable model, (c) SVM aided non
trivariable model
_
________ ISSN 2231606X
Res. J. Chem. Sci.
28
GATS3m was found as most sensitive descriptor with high area
under the curve (2.839) in multiple linear regression models. In
linear QSAR models descriptor IC1
nsitive descriptor with area under the
curve respectively (7.810 and 0.249).
In the present study we have compared the performance of MLR

linear regression in QSAR
models. Models obtained from SVM
aided nonlinear regression
were found statistically fit and more predictive
than models
obtained from multiple linear regression and SVM aided linear
regression. Descriptors (IC1, SIC3) and (GATS3m, GATS5p),
used in MLR and SVM aided linear and non
linear regression
are from same class of descriptors. These descriptors contribute
activity relationship and code for same chemical
structure property but differ in linear and non
linear
relationship. Descriptor IC1 was found as the most sensitive
descriptor with highest area under the curve. Descriptor
sensitivity analysis illustrates the importance of molecular
variable model, (c) SVM
aided non
variable model, (c) SVM aided non
linear
Research Journal of Chemical Sciences ___________________________________________________________ ISSN 2231606XVol. 4(7), 2429, July (2014) Res. J. Chem. Sci. International Science Congress Association
29
References 1.Singh S., The Genetics of Type 2 diabetes mellitus: A Review., J. Sci. Res., 55, 3548 (2011) 2.Qaseem A., Humphrey L.L., Oral Pharmacologic Treatment of Type 2 Diabetes Mellitus: A Clinical practice Guideline from the American college of Physicians, Ann Intern Med., 156, 218231 (2012) 3.Bahare R.S., Gupta J., Malik S., Sharma N., New Emerging Targets for Type2 Diabetes, Intl. J. Pharm Tech.,3(2), 809818 (2011) 4.Johnson T.O., Ermolieff J., Jirousek M.R., Protein tyrosine phosphatise 1b inhibitors for diabetes, Nat. Rev. Drug Discov., 1, 696709 (2012) 5.Koren S, Fantus I.G., Inhibition of the protein tyrosine phosphatase PTP1B: potential therapy for obesity, insulin resistance and type 2 diabetes mellitus, Best PractRes Clin Endocrinol Metab., 21, 62140 (2007) 6.Goldstein B.J., Proteintyrosine phosphatase 1B (PTP1B): a novel therapeutic target for type 2 diabetes mellitus, obesity and related states of insulin resistance, Curr Drug Targets Immune Endocr Metabol Disord. 1, 26575 (2001) 7.Rosenbloom A.L., Silverstein J.H., Amemiya S., Zeitler P., Klingensmith G.J,. Type 2 diabetes in the child and adolescent, Pediatr Diabetes., , 512–526 (2008) 8.Dearden J.C., In silico prediction of drug toxicity. J. Comput Aided Mol. Des.,17, 119 27 (2003) 9.Nantasenamat C., IsarankuraNaAyudhya C., Naenna T., Prachayasittikul V., "A practical overview of quantitative structureactivity relationship". Excli J.,, 7488 (2009) 10.Nantasenamat C., IsarankuraNaAyudhya C., Naenna T., Prachayasittikul V., Advances in computational methods to predict the biological activity of compounds, Expert Opin Drug Discov.,, 633–54 (2010) 11.Poole D., Mackworth A., Goebel R., Computational Intelligence: A Logical Approach, Oxford University Press, USA, (1998) 12.Yegnanarayana B., Artificial neural networks, PHI Learning Pvt. Ltd. (2009) 13.Joachims T., Text categorization with support vector machines: Learning with many relevant features. Technical Report 23, LS VIII, University at Dortmund (1997) 14.Rokach L., Maimon O., Data mining with decision trees: theory and applications. World Scientific Pub Co. Inc. (2008)15.Ramoni M., Sebastiani P. "Robust bayes classifiers." Artificial Intelligence., 125: 209226 (2001) 16.Montgomery D.C., Peck E.A., Vining G.G., Introduction to linear regression analysis, John Wiley & Sons. 821 (2012) 17.NavarreteVazquez G. et al. Synthesis, in vitro and computational studies of protein tyrosine phosphatise 1B inhibition of a small library of 2arylsulfonylamino benzothiazole with antihyperglycemic activity. Bioorg Med chem.,17:33323341 (2009 18.MarvinSketch version 5.5.1,(2009), Chemaxon.http://www.chemaxon.com.19.VCCLAB, Virtual Computational Chemistry Laboratory, http://www.vcclab.org. (2005)20.SarchitectTM 2.5 Designer/Miner, Strand Life Sciences Pvt. Ltd., Bangalore, India, (2008). 21.Cortes C., Vapnik V., SupportVector Networks. Mach. Learn. 20, 273297 (1995)22.Mangasarian O.L., Musicant D.R., Large scale kernel regression via linear programming, Mach. Learn., 46:255269 (2002)23.Klopman G., Kalos A.N., Causality in StructureActivity Studies. J. Comput. Chem., :492506 (1985)24.Wold S, Eriksson L. Statistical Validation of QSAR Results. In: van de Waterbeemd, H. (Ed.) Chemometric Methods in Molecular Design, Weinheim., 309318 (1995)
25.
Saltelli A., Tarantola S., Campolongo F., Ratto, M.,
Sensitivity analysis in practice: a guide to assessing
scientific models. John Wiley & Sons, (2004)