Scientific paper Quantitative Relationships Between Structure and Lipophilicity of Naturally Occurring Polyphenols Vesna Rastija,1* Sonja Nikoli}2 and Vijay H. Masand3 1 Faculty of Agriculture, University J. J. Strossmayer, Faculty of Agriculture, P. Sva~i}a 1d, Osijek, 31 000, Croatia 2 The Rugjer Boskovic Institute, P.O. Box 180, Zagreb, 10 002, Croatia 3 Department of Chemistry, Vidya Bharati College, Camp, Amravati, Maharashtra, 444 602, India * Corresponding author: E-mail: vrastija@pfos.hr Received: 13-05-2013 Abstract The lipophilicity of polyphenols inherent in food, beverages, and medicinal plants was modelled by using 3D descriptors derived from optimized 3D molecular structures in combination with 2D descriptors. The training sets were generated by manual selection or by cluster formation, and statistically robust predictive models were obtained in both cases. The most relevant structural features for the lipophilicity of polyphenols are depicted by the statistically most significant variables: the number of donor atoms for the H bonds is unfavorable for lipophilicity, and the enhanced number of ring secondary C atom (sp3) also decreases lipophilicity, while the increased atomic polarizability implies higher lipophili-city of polyphenols. The study also revealed the importance of a three-dimensional distribution of atomic electronegativity for the lipophilicity of molecules. Keywords: Lipophilicity; polyphenols; quantitative structure-property relationships 1. Introduction Specific groups of food and medicinal plants are rich sources of one or more subclasses of polyphenols.1 These compounds have been reported to possess multiple biological activities including vasodilatatory, anti-inflammatory, anti-carcinogenic, anti-bacterial, and antioxidant effects.2-6 However, the health effects of polyphenols also depend on the amount consumed and their bioavailability.1 Bioavailability in humans differs greatly from one polyphenol to another. Gallic acid and the isoflavones are readily absorbed, followed by the catechins, the flavanones, and the quercetin glucosides, while the least well absorbed are the proanthocyanidins and the anthocyanins.7 The oral bi-oavailability of pharmacologically active substances depends on their hydrophilicity-lipophilicity balance. Strong hydrophilicity of a compound implies good water solubility and good dissolution in gastrointestinal fluid. However, lipophilic compounds have the ability to diffuse passively through biological barriers owing to the lipoid nature of the cell membranes.8 Besides, the quantitative structure-activity relationship (QSAR) studies have revealed that the lipophilicity of polyphenols is an important factor regarding their activity in biological systems.9 Our recent QSAR studies have indicated that the antioxidant activity of flavonoids and the vasodilatory effect of phenolic acids are strongly related to lipophilicity.310 The application of quantitative structure-property relationship (QSPR) approaches in developing models to predict the physicoche-mical properties of polyphenols using topological indices has been reported previously.11 However, it is well know that many physical, chemical, or biological properties of compounds depend on the three-dimensional arrangement of atoms in a molecule.1213 Here, in view of the importance of the three-dimensional shape of molecules for passive diffusion through biological membranes, our attempt was to develop QSPR models that relate the experimentally determined lipophilicity of polyphenols to descriptors derived from optimized 3D molecular structures in combination with 2D descriptors. Due to the importance of rigid model validation, the dataset must be divided into a training set and a test set. Therefore, the goal of this study has also been to determine which of the two splitting methods (manual selection and cluster analysis) give better results. 2. Results and discussion Data set I After the classification of 51 compounds into 11 structurally different classes of polyphenols, 12 compounds (Table 1) were selected for the test set manually, and the rest of the initial data set for the training set (n = 39, or 76% of the full data size). The best QSAR models with two and three descriptors (I = 2 and 3), generated by using MLR and the best-subset method, are given in Table 2. Data set II We have classified the initial dataset into clusters using the Tree Clustering method (Fig. 1) performed on the set of values of 580 selected descriptors. Members of the test set (n = 12) have been chosen from each cluster. The Table 1. Chemical structure of polyphenols used in the current study Class of polyphenol Comp.no. Supstituents Name of polyphenol log P Exp. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 3-OCH3; 4-OH; R = OH 3,5-OCH3; 4-OH; R = OH 3,4,5-OH; R = OH 2-OH; R = OH 4-OH; R = OH 3-OH; R = OH 3.4-OH; R = OH 2.5-OH; R = OH 2,5-OH; R = H 3-OCH3; R = OH 3,4-OCH3; R = OH 2.3-OH; R = OH 3,4,5-OH; R = OCH2CH3 2-OH; R = OH 4-OH; R = OH 3-OCH3; 4-OH; R = OH 3.4-OH; R = OH 3,4-OH; R = Vanillic acid Syringic acid Gallic acid Salicylic acid ^-Hydroxybenzoic acid ra-Hydroxybenzoic acid Protocatehuic acid Gentisic acid 2,5-Dihydroxybenzaldehide 3-Methoxybenzoic acid Veratric acid Pyrocatechuic acid Ethyl gallate o-Coumaric acid p-Coumaric acid Ferulic acid Caffeic acid Chlorogenic acid 1.43 1.04 0.70 2.26 1.58 1.50 0.86 1.74 0.54 2.02 1.61 1.20 1.30 1.59 1.79 1.51 1.15 0.30 19 20 21 22 23 24 25 26 27 trans-4-OH; R = OH 4-OCH; R = OH 3',4'-OH; R = OH 2',4'-OH; R = OH 4'-OH; R = OH 3',4'-OH; R = SU1 3',4'-OH; R = SU2 trans-p-Coumaric acid 4-Methoxycinnamate Mandelic acid Catechin Quercetin Morin Kaempferol Quercetin-3-O-glucoside Rutin 1.46 2.68 0.62 0.51 1.82 1.84 3.11 0.76 -0.64 Class of polyphenol Comp.no. Supstituents* Name of polyphenol log PExi 28 5,3'-OH; 4'-OCH3 Hesperetin 2.60 29 3',4'-OH Fustin 0.87 30 5,3',4'-OH Taxifolin 0.95 31 Flavanone 3.14 32 5,7,4'-OH Naringenin 2.60 33 5,7,3',4'-OH Eriodictyol 2.27 34 5,4'-OH, 7-SU2 Naringin -0.44 35 4'-OH 4'-Hydroxyflavanone 3.20 36 5,3'-OH; 4'-OCH3, 7-SU2 Diosmin 0.14 37 5,7,3'-OH; 4'-OCH3, Diosmetin 3.10 38 5,7,4'-OH Apigenin 2.92 39 5,7,3',4'-OH Luteolin 2.53 40 5,7-OH Chrysin 3.52 41 Flavone 3.56 42 5-OH 5-Hydroxyflavone 4.30 43 7-OH 7-Hydroxyflavone 3.62 44 5,3',4'-OH 5,3',4'-Trihydroxyflavone 3.31 45 46 47 48 49 50 51 7,4'-OH 7,4'-OH 5,7,4'-OH 7,4'-OH, 6-OCH3 5, 4'-OH, 7-SU1 3,5, 4'-OH Equol Daidzein Genistein Glycitein Genistin Resveratrol Eugenol 3.20 2.51 3.04 1.97 0.97 3.32 2.27 * In these compounds, the substituent groups corresponding to the SUgar moieties have been abbreviated as SU suffixed with a number as: SU1 : O-^f-D-glucopyranosyl; SU2 = O-(6-deoxy-â-L-mannopyranosyl)-^f-D-glucopyranosyl. best QSAR models with two and three descriptors (I = 2 and 3), generated by using MLR and the best-subset method, are given in Table 2. The correlation matrix, obtained from the initial data set, given in Table 3, shows that the descriptors included in Eqs. 1-4 are independent. A scatter plot of log Pexp versus log Ppred values calculated by Eq. 4, for the studied polyphenols, is shown in Fig. 2. In order to investigate the applicability of the prediction model 4 and to detect the possible outliers, leverage of the training set was plotted against the residuals (Fig. 3). As it can be seen from the plot, there are no compounds outside the domain of applicability of the model, since their leverage values are not greater than the warning leverage (h* = 0.308). Moreover, none of the analysed compounds were Fig. 1. Dendrogram of a cluster formation of 51 polyphenols considered as outliers because their standardized residuals were not greater than ± 2. The F-test reflects the ratio of variance explained by the model and variance that is due to an error in the model. The high values of F-test indicate that the model is statistically significant. The values of Fisher ratio (F) for QSPR models (Eqs. 1-4) ranging from 65.147 to 76.811 (Table 2) suggest that all QSPR models are statistically significant at the 95% level. However, the main disadvantage of this statistical parameter is that it is highly sensitive to the number of descriptors in an equation. An increase in the number descriptors from 2 to 3 in equations causes a reduction of F value. Better indicators for the statistical significance of QSPR model are the squared correlation coefficient (r2) and the standard deviation (s). The closer the value of r2 to the unity and the smaller the value of s, the better the QSPR models.14,15 Although all obtained models have a high value of regression coefficients and a low standard deviation, better results are obtained when using models with three descriptors (Eqs. 2 and 4). The stability of models is proved by the close values of R2LOO and R2l-10-0. The robustness of the developed models was checked by using the Y-randomization technique. After five randomizations, the resulting models had a significantly lower R2 than the original model (Table 4). This proves that neither of the developed models is a result of chance correlation. The values of experimental and predicted log P values obtained by using Eqs. 2 and 4 are given in Table 5. Leonard and Roy16 claim that the quality of the developed models depends considerably on the algorithm used for the selection of the training and test sets. They have performed a validation of QSAR models for the data sets generated by three different methods of division and the best results were obtained when the training test and the test set were selected by the ^-means cluster. However, our best model was obtained by Eq. 2, where the training and test sets were selected by manual division, and the model obtained by Eq. 4, where the data set was divided by the tree clustering method. Both models have comparable results, such as R2 value of 0.848 and 0.850 for models obtained by Eqs. 2 and 4, respectively. Since the difference in the parameters of quality is rather small in both models, we cannot decide about the better method for the division of data set into the training set and the test set. The four best models include: zero-dimensional descriptors (constitutional), one-dimensional descriptors (functional groups counts), two-dimensional descriptors (information and 2D autocorrelations descriptors), and three-dimensional descriptors (RDF and GETAWAY descriptors). All descriptors have been generated from optimised three-dimensional structures of molecules. The most relevant structural features for the lipophi-licity in Eqs. 2 and 4 are depicted by the statistically most t- in 00 00 0 in IN 0. 0. 1. 0. in t- CO t- 00 .0 t- 0. 0. 0. 0. in t- 0 0 o\ 00 CO 00 0. 0. 0. 0. CO 0 CO 00 ^ IN t- oo t- 00 0. 0. 0. 0. 1 t- t- IN CM in 1 00 00 t- 00 t- 0. 0. 0. 0. 1 t- t- in 1 + £ ^ S3 o S * t- o CO CO "1 o + 1 ^ 00 CO O in 11. 3 Î 0.70) descriptors were discarded. The best models were selected based on the squared correlation coefficient (R2), adjusted squared correlation coefficient (R2adj), standard deviation of regression (Sfit), and Fisher ration values (F). If the overall model is significant for the prediction of dependent variables in multiple linear regression, the statistical significance of each independent variable in the mode can be tested separately by the t-test: t=j SPj U) where ¡3j is the standardized regression coefficients of the independent variable j and S^ j is the standard error of ¡¡. The higher t-test values mean that the independent variable is more significant. In all presented equations, variables are listed according to their statistical signification according to ¡3 and t-values.14,15 3. 5. Validation of Models The generated QSPR models were validated by using the classical Leave-One-Out (LOO) cross-validation technique, and also by the Leave-Many-Out (LMO), more precisely the Leave-lO-Out (L-10-O) cross-validation procedure. The statistical stability of a model was revealed by the high values of correlation coefficient R2 Leave-One-Out (R2loo) and coefficient R2 Leave-lO-Out (R2L_10_ O). Additionally, the Y-randomization technique was applied to validate and check the robustness of MLR equation.23 Cross-validation and Y-randomization were performed using the data mining software Weka (http://www.cs. waikato.ac.nz/ml/ weka/). Detection of outliers was carried out by investigating the applicability domain of a prediction model.24 3. 6. Identifying Outliers Investigation of the applicability domain of a prediction model was performed by leverage plot (plotting residuals vs. leverage of training compounds). Detection of outliers was carried out for compounds that have values of standardized residuals greater than two standard deviation units. The leverage h* of a compound is the measure of its influence on the model and is defined as: h* = 3 x p' / n (2) where n is the number of training compounds andp' is the number of model-adjustable parameters. 4. Conclusions Multiple linear regression is used to estimate the li-pophilicity of polyphenols present in medicinal plants and food, using descriptors calculated from an optimized three-dimensional molecular structure. In order to determine the best method for dividing the data set into the training and test sets, two methods were performed: manual division and cluster analysis. No great difference was observed in the quality of models from the training and test sets obtained by using these methods. Since the best models that relate to the experimentally obtained log P values are generated using two-dimensional descriptors, it is possible that the three-dimensional structure of polyphenols is irrelevant for their lipophilicity. The structural lipophili-city models presented in this study revealed, besides the well known fact that the number of donor atoms for H bonds is unfavourable for lipophilicity, that the enhanced number of ring secondary C atom (sp3) also decreases li-pophilicity, while the increased atomic polarizability implies a higher lipophilicity of polyphenols. The study also confirmed the importance of a three-dimensional arrangement of atomic electronegativity for the lipophilicity of molecules. The above-mentioned descriptors could be used for further QSPR investigation of polyphenols, and the proposed models could potentially provide information about the lipophilicity of other biological active polyphenols, such as the anthocyaninis, the anthocyani-dins, and the procyanidins, which are normally commercially unavailable or expensive, and their separation from the plant and food samples and their accurate identification requires techniques with especially high running costs. 5. References 1. C. Manach, A. Scalbert, C. Morand, C. Rémésy and L. Jiménez, Am. J. Clin. Nutr. 2004, 79, 727-747. 2. X. Han, T. Shen and H. Lou, Int. J. Mol. Sci. 2007, 8, 950988. 3. I. Mundic, D. Modun. V. Rastija, I. Brizic, V. Katalinic, M. Medic-Saric and M. Boban, Food Chem. 2009, 119, 12051210. 4. L. H. Yao, Y. M. Jiang, J. Shi, F. A. Tomâs-Barberân, N. N. Datta, R. Singanusong and S. S. Chen, Plant. Food Hum. Nutr. 2004, 59, 113-122. 5. M. Brvar, A. Perdih, V. Hodnik, M. Renko, G. Anderluh, R. Jerala and T. Solmajer, Bioorg. Med. Chem. 2012, 20, 25722580. 6. M. Brvar, A. Perdih, V. Hodnik, M. Renko, G. Anderluh, D. Turk and T. Solmajer, J. Med. Chem. 2012, 55, 6413-6426. 7. C. Manach, G. Williamson, C. Morand, A. Scalbert and C. Remesy, Am. J. Clin. Nutr. 2005, 81, 230S- 242S. 8. C. A. Lipinski, F. Lombardo, B. W. Dominy and P. J. Feeney, Adv. Drug Delivery. Rev. 2001, 46, 3-26. 9. D. Amic, D. Davidovic-Amic, D. Beslo, V. Rastija, B. Lucic and N. Trinajstic, Curr. Med. Chem. 2007, 14, 827-845. 10. V. Rastija and M. Medic-Saric, Eur. J. Med. Chem. 2009, 44, 400-408. 11. V. Rastija, S. Nikolic and M. Medic-Saric, J. Math. Chem. 2009, 46, 820-833. 12. N. P. Seeram and M. G. Nair, J. Agric. Food Chem. 2002, 50, 5308-5312. 13. V. Rastija and M. Medic-Saric, Med. Chem. Res. 2009, 18, 579-588. 14. S. Wold, Quant. Struct.-Act. Relat. 1991,10, 191-193. 15. R. P. Verma and C. Hansch, Eur. J. Med. Chem. 2010, 45, 1470-1477. 16. J. T. Leonard and K. Roy, QSAR Comb. Sci. 2006, 25, 235-251. 17. S. A. Aherne and N. M. O'Brien, Nutrition, 2002, 18, 75-81. 18. R. Todeschini and V. Consonni, "Handbook of molecular descriptors", M. Manhold, H. Kubinyi and H. Temmerman, Eds., Wiley-VCH, Weinheim, 2000, p. 667-673. 19. J. A. Rothwell, A. J. Day and M. R. Morgan, J. Agric. Food Chem. 2005, 53, 4355-4360. 20. Z. N. Xiang and Z. X. Ning, LWT-Food Sci. Technol. 2008, 41, 1189-1203. 21. M. J. S. Dewar, E. G. Zoebisch, E. F. Healy and J. J. P. Stewart, J. Am. Chem. Soc. 1985, 3902-3909. 22. I. V. Tetko, J. Gasteiger, R. Todeschini, A. Mauri, D. Livingstone, P. Ertl, V. A. Palyulin, E. V. Radchenko, N. S. Zefirov, A. S. Makarenko, V. Y. Tanchuk and V. V. Prokopenko, J. Comput. AidedMol. Des. 2005, 19, 453-463. 23. A. Tropsha, Mol. Inf. 2010, 29, 476-488. 24. L. Eriksson, J. Jaworska, A. P. Worth, M. T. D. Cronin, R. M. McDowell and P. Gramatica, Environ. Health Perspect. 2003, 111, 1361-1375. Povzetek Lipofilnost polifenolov, ki so prisotni v hrani, pijačah in medicinskih rastlinah, smo modelirali z 3D deskriptorji, ki izhajajo iz 3D molekularnih struktur v kombinaciji z 2D deskriptorji. Učni set smo ustvarili z ročno izbiro ali z gručan-jem, v obeh primerih smo dobili robustne predikcijske modele. Najpomembnejše strukturne lastnosti, pomembne za lipofilnost polifenolov, so opisane z statistično najznačilnejšimi spremenljivkami: število donorskih atomov H vezi je neugodno za lipofilnost, povečano število obročev sekundarnih C atomov (sp3) prav tako zniža lipofilnost, medtem ko povečana polarnost atomov nakazuje na povečano lipofilnost polifenolov. Študija je pokazala tudi pomembnost tridimenzionalne razporeditve elektronegativnosti atomov za lipofilnost molekul.