ISSN : ISSN No. 2472-1956

Journal of Informatics and Data Mining

Combining Free Access Methodologies to Build 2D QSAR Models: A Case Study for Piperazine-nitroimidazole Analogues against S. aureus

Edilson B. Alencar Filho*, Aline A. Santos

Collegiate of Pharmaceutical Sciences, Federal University of San Francisco Valley, Petrolina, Pernambuco, Brazil, 56304-917

*Corresponding Author:
Edilson B Alencar Filho
Collegiate of Pharmaceutical Sciences, Federal University of San Francisco Valley
Petrolina, Pernambuco, Brazil, 56304-917
Tel: +55 87 3216 6862
E-mail: edilson.beserra@univasf.edu.br

Received date: Sep 11, 2015, Accepted date: Oct 24, 2015, Published date: Oct 29, 2015

Visit for more related articles at Journal of Informatics and Data Mining

Abstract

Quantitative Structure-Activity Relationships (QSAR) approaches involve the search for mathematic models that correlates a set of biological properties (or activities) from a series of analogues with their molecular features, also called “molecular descriptors”. These studies seek to understand how molecular features, previously identified, can influence the biological responses, allowing the design and future synthesis of potentially most active compounds. In this work we present the use of free and open source softwares to data processing in a 2D QSAR perspective, using a set of metronidazole-piperazine analogs with activity against S. aureus, a relatively common species of bacteria that has represented a serious public health problem, due their mechanism of resistance to drugs. The combination of different methodologies conduced to a robust and predictive QSAR model. These strategies can be used in programs of medicinal organic synthesis and drug design.

Keywords

QSAR, Free softwares, Piperazine-nitroimidazole analogues, S. aureus

Introduction

In the context of bioactive compounds, Quantitative Structure- Activity Relationships (QSAR) approaches involves the search for mathematic models that correlates a set of biological properties (or activities) from a series of analogues with their molecular features, also called “molecular descriptors”. These studies seek to understand how molecular features previously identified can influence the biological responses, allowing the design and future synthesis of potentially most active compounds.

QSAR’s has in general the following steps: Selection of analogues compounds with particular activity (important to a better comprehensive structure-activity relationships); Determination of molecular descriptors by experimental or computational molecular modelling methods; Application of a variable selection methodology to descriptors; Construction and validation of mathematic models by statistical/chemometric tools; Interpretation of final model.

Multivariate regression is the core of QSAR analysis. It is based on obtaining a model Y=bX+e, where Y is a column matrix with values of molecular property or activity (dependent variable), X is an I x J matrix with values of J molecular descriptors of I compounds. [1] The J x 1 vector b contains the calibration regression coefficients and e represents the I x 1 error vector. [1] The equation can be obtained by Multiple Linear Regression (MLR), [2] but when there is a great number of variables it is not adequate. Thus, techniques as Principal Component Regression (PCR) and Partial Least Squares (PLS), based on Principal Component Analysis (PCA), can be used [3-5].

The kind of molecular descriptors used in QSAR modelling provides the different nomenclatures as 2D, 3D, 4D QSAR among others. The most cited are 3D and 4D methodologies and they are based on alignment of a set of compounds in a common spatial arrangement, defined by a common portion of the molecules, which are involved in a three-dimensional grid to generation of descriptors via interaction potentials with selected probes (specific functional groups models, charged or not). 4D differentiates by the uses of conformational sampling profile of each compound [6]. On the other hand, 2D methodologies use general molecular or atomic properties from quantum chemical or molecular mechanic calculations, experimental parameters (as LogP, pKa), fragmental index and other numerical values, as long as they do not represent interaction potentials with probes. In this context, E-DRAGON online plataform represents a free remote version of DRAGON software, which is dedicated to calculation of molecular descriptors of various kinds. [7-9] These descriptors can be used to evaluate molecular structure-activity relationships, similarity analysis and screening of molecule databases. [7,8,10] Initially, molecular geometries have to be converted to appropriate format, usually sdf, smiles or mol2 formats. Then the file can be loaded on the page, providing more than 1,660 descriptors, divided into 20 logical blocks. [9,10] Several topological and geometrical descriptors can be calculated, besides atom type, functional group and fragment counts. [9,10]

Variable selection is other important step in QSAR analysis. Ordered predictors Selection (OPS) is an algorithm developed and published in 2009 by Teófilo and coworkers, and can be downloaded by previous registration in the software page [1,11]. This methodology is based on sort the variables from an initial informative vector, followed by a systematic investigation of Partial Least Squares (PLS) regression models with the aim of finding the most relevant set of variables by comparing the crossvalidation parameters of the models systematically obtained [1].

Considering our continuous work in Cheminformatics area, this paper presents a 2D QSAR study of a set of compounds with bactericidal activity, using free and open source softwares. The adopted methodologies can be used in other datasets to optimize the synthesis of bioactive compounds, in drug design programs.

Methodology

Because of our interest on antimicrobial and antiparasitic compounds, specially those containing nitroimidazole moiety, the set of compounds was selected from a paper of Wang and coworkers [12] (Figure 1) and corresponds to a series of metronidazole derivatives containing piperazine skeleton with antibacterial activity against Staphylococcus aureus. This is a relatively common bacteria which can cause various types of infections, from a simple infection (pimples, boils) to more serious health problems (meningitis, pneumonia, endocarditis). A serious aspect of this microorganism is the acquired resistance of some strains to certain antibiotics such as vancomycin, methicillin, among others [12]. Thus, the search for new agents for the treatment of S. aureus infections is an attractive and important field of research. The original activity was considered as a Minimum Inhibitory Concentration (MIC) in μg/mL, by MTT method. [12] For QSAR purpose, the values were converted to molar units and expressed as–LogMIC (pMIC) (Figure 1).

datamining-Compounds-obtained-paper

Figure 1 Compounds obtained from paper of Wang and coworkers (2014).

For calculation of descriptors, molecules were submitted to a geometry optimization using PM3 semiempirical method in Arguslab 4.0 free software [13]. The optimized geometries were converted to sdf format using free OpenBabel, [14] an appropriate extension to load the structures on the E-Dragon platform. This platform is easy to use and quickly generates a txt extension with a matrix of 1,666 2D descriptors, as described on introduction. These descriptors were organized in an Excel worksheet J x I and then pasted in a text file. We opted to not choose training and test sets [15]. All the 15 molecules were used in QSAR analysis considering the small number of compounds with activity available in the literature, so as not to lose important information in the modeling [15]. In this way, this study represents an initial direction for optimization of metronidazole-piperazine analogues.

Another txt file contain a column with pMIC values were saved. The two text files were opened in QSAR Modelling free software [16] to run OPS algorithm. In this software there is an option to autoscaling variables, important when we work with different units of measurement [4]. The parameters for the OPS procedure were: “Window” of 10 initial variables, among the most correlated; “increment” of 5 variables until a maximum value of 30%. “SPRESS” was chosen as the sort parameter, the model with the lowest value corresponding to the selected variables. SPREES represents the standard deviation of PRESS, which is inserted in the following root [1]:

image

whereIcorresponds to the number of samples used and NVL represents the number of latent variables for PLS model; yi pred is the predicted value and yi exp the experimental value of biological activity. It is important to consider that the variable selection process with QSAR modeling can be done with more than one type of validation parameter, beyond SPRESS, as the coefficient of determination of leave-one-out cross validation,Q2 loo [17]. During this process, depending on the number of initially selected descriptors, the generated matrix may be subjected to new selections, up to a minimal set of descriptors that meet the model validation parameters and have the physical sense in the structure-activity relationships. Thus, the development of a QSAR model also depends of a little patience and experience of modeler to identify descriptors that meet the criteria of robustness and model predictability, as well enable consistent interpretation of it.

If the ratio between the numbers of samples/descriptors is <5, a multivariate regression model as PCR or PLS has to be used to generate the coefficients of final regression model. They are implemented on QSAR modeling software. If >5, MLR technique can be used, with the advantage that preserves the original variables in final model, more easily to interpret.

The two principal quality parameters are the coefficient of determination of calibration (R2) and coefficient of determination of leave-one-out cross validation (Q2 loo). In R2 evaluation, all the samples are considered to construct the model before prediction (Equation 2). In Leave-one-out, one sample is taken at a time and a regression model is constructed, which is used to predict the activity of the sample taken. In these determinations all compounds are used. The values predicted for all samples are used in calculation of quality parameter Q2 loo (Equation 3). Literature reports that acceptable values of these parameters are: Q2 loo > 0.5; R2 > 0.6. [17,20]

image

image

*cal represents the predicted values using model of calibration constructed with all samples; mean represents the mean of experimental values of dependent variable.

Lack of correlation between independent variables in a QSAR is also important. In general is usually correlations >0.6 are avoided [18]. The free software Build QSAR [19,20] is an way to calculate correlation coefficients between descriptors, as well generate MLR or PCR models. Considering the number of final selected variables, in this work we opted for MLR modeling using this program.

Results and Discussion

Fully optimized geometries of compounds are presented in Figure 2. Molecular descriptors generated by E-Dragon plataform and selected using OPS algorithm can be visualized in Table 1 and Figure 2.

Composto pMIC G(N…N) GATS8e RDF100e H0e
1n 3.8 43.521 1.556 7.407 3.072
2n 3.824 43.665 1.118 7.623 3.248
3n 4.539 37.457 1.457 5.75 3.207
4n 6.236 35.839 1.493 5.976 3.138
5n 4.265 43.493 0.811 8.543 3.455
6n 5.332 43.687 1.525 12.249 3.02
7n 4.481 43.409 1.328 5.947 3.275
8n 6.45 38.807 1.156 7.404 3.131
9n 4.351 43.56 1.335 8.125 3.156
10n 5.72 38.807 1.156 7.404 3.131
11n 5.329 40.144 1.493 6.78 3.155
12n 5.773 43.701 1.397 15.948 3.08
13n 8.131 38.466 1.351 17.809 2.997
14n 5.87 40.322 1.513 8.912 3.076
15n 3.946 38.757 1.924 8.966 3.321

Table 1: Biological activity against S. aureus in pMIC and E-Dragon molecular descriptors selected after successive runs of OPS algorithm.

datamining-piperazine-metronidazole

Figure 2 3D optimized geometries of piperazine-metronidazole derivatives.

After variable selection, MLR model was constructed using Build QSAR. The regression vector obtained in this step is:

pMIC= – 0.2749(G(N...N)) – 1.9008(GATS8e) + 0.1584(RDF100e) – 4.3620(H0e) + 31.4384 (4)

The coefficients of determination were R2 = 0.87 and Q2 loo= 0.77, indicating that a robust and predictive model was obtained. The standard error of prevision was s = 0.5. Value of SPRESS = 0.68 and the correlation matrix can be visualized in (Table 2). Together, this results show a good QSAR model for this class of new anti S. aureus agents.

Descriptor G(N...N) GATS8e RDF100e H0e
G(N...N) 1 0.274 0.164 0.131
GATS8e 0.274 1 0.056 0.324
RDF100e 0.164 0.056 1 0.474
H0e 0.131 0.324 0.474 1

Table 2: Correlation matrix between selected descriptors.

In addition to allow the prediction of compounds activity, in a QSAR model is important to know the molecular descriptors used, analyzing the particular characteristics, seeking to rationalize the physical sense for the design of more active molecules. However, it is important to note that the overall effect on a molecule is given by the sum of the contributions of each selected descriptor. Therefore, we can analyze can analyze the influence of the descriptor in the activity individually but not neglect the concomitant influence of other descriptors.

G(N...N): This descriptor is related as the sum of the geometric distance between nitrogen atoms [21]. The greater its numerical value, the lower the activity of molecules. Observing the structures, we conclude that molecules with axial nitroimidazole, which leads to smaller distances N...N, are more active. In other words, most compact conformational geometries showed to be important to biological activity;

GATS8e: Related to the "autocorrelation Geary" at a distance of 8 bonds, weighted by electronegativity [21]. Thus, greater electronegativity difference between two atoms at a distance of 8 bonds, leads to greater the value of the descriptor. The higher values of this descriptor, the lower activity. Electronegative atoms in para (4) position of phenyl ring leads to a greater value of the descriptor, decreasing the activity (compounds 2n, 3n and 15n are the less active, with presence of halogens on this position);.

RDF100e: Related to the Radial Distribution Function weighted by electronegativity and radius 10.0 A [21]. Probability of finding an atom in a radius of spherical volume 10. Higher number of atoms in a molecular volume increases its value. Considering regression vector, higher numerical value of this descriptor, increases the activity. Molecules containing phenyl groups and presence of more than one phenyl ring have a higher value of this descriptor. This may be related to π−π stacking interactions, which have fundamental importance in drug-target pharmacodynamics. This can be visualized on compound 13n, the most active of series, which has two phenyl groups without substituents. On the other hand, compound 15n, even having two phenyl groups has one of the lowest activities. Within the discussions about GATS8e, this molecule has two halogens in para position, what may explain the decreased effect on activity.

H0e: H autocorrelation without distance, weighted by electronegativity. Denotes the influence of electronegative atoms in molecular form [21]. The more electronegative atoms in the molecules, the higher the value of this descriptor. Regression vector shows that the higher values of this descriptor, the lower activity. Molecules containing more electronegative atoms tend to be less active. In addition to information of GATS8e descriptor, these electronegative atoms can be attributed to halogens (2n, 3n, 7n, 15n). Compounds 8n and 13n are the most actives and have no chlorine or fluorine electronegative atoms.

Conclusions

In accordance with obtained results, combination of free software’s presented here conduced to a robust and predictive QSAR model. These data processing strategies can be used for drug design and lead compounds optimization, especially in academical level, assisting the efforts of the chemical synthesis and pharmacology.

In relation to the explored data set, four descriptors were very important to explain the biological activity of the compounds and the obtained equation can be used to make predictions of other piperazine-metronidazole analogues ideally designed based on these characteristics. Molecules with a more "compact" conformational behavior, with phenyl and non-electronegative substituents (halogens) especially that not occupy the para positions, should lead to more active analogues against S. aureus. In addition, the generated equation can be used to predict the activity of new analogues compounds, saving synthetic efforts.

Acknowledgments

The authors are grateful to Research Support Foundation of the State of Pernambuco (FACEPE), National Counsel of Technological and Scientific Development (CNPq) and Federal University of San Francisco Valley (UNIVASF), for financial support.

References

Select your language of interest to view the total content in your interested language

Viewing options

Flyer image
journal indexing image

Share This Article