239 Acta Chim. Slov. 1998, Y5(3), pp. 239-252 (Received 23. 9.1998) ON CHARACTERIZATION OF MOLECULAR ATTRIBUTES ' Milan Randic Department of Mathematics, Drake University, Des Moines, Iowa, 50311 and National Institute of Chemistry, Ljubljana, Slovenia Abstract: We report here on recent developments in the characterization of molecular branching, molecular cyclicity, molecular shape, and chirality using graph theoretical invariants. INTRODUCTION Quantitative characterizations of molecular structural features has been overlooked and neglected for too long time. It is not uncommon to come across qualitative statements about molecular branching, molecular cyclicity, or molecular shape. Thus, for example, occasionally we find stated in chemical textbooks that the boiling points of alkanes decrease with the degree of molecular branching, even though the branching in alkanes has not been rigorously defined. In fact, such a statement has motivated Wiener [1] to develop his graph theoretical approach to structure-property correlation analysis. It is generally accepted that most molecular properties critically depend on molecular size and molecular shape. Yet while molecular size can be characterized Presented as a lecture at the X. ECSM, June 10-12, 1998, Maribor, Slovenia 240 relatively well by number of molecular descriptors, e. g., the number of atoms in a molecule (particularly non-hydrogen atoms), or molecular weight, the characterization of molecular shape remains elusive. Similar situation is with the characterization of molecular cyclicity, chirality, the degree of folding, degree of planarity, molecular complexity, aromaticity, and several other molecular attributes. The same extends to a degree to the characterization of molecular similarity and molecular diversity, the properties that relate to a class of molecules rather than to an individual chemical structure. In this report we will review recent developments in characterization of several of the mentioned molecular attributes. DESIDERATA First, let us emphasize the distinction between the characterization of a molecule and the representation of molecular structure. The characterization refers to description of molecular structures by one or more descriptors (parameters, graph theoretical or topological indices, structural indices). The representation refers to giving a molecule unique label, name, code, which permits its full reconstruction. In the language of computer science we may say that molecular representation corresponds to an input information on a molecule (or molecular graph) while molecular characterization corresponds to mathematical property of a structure, which typically comes as an output of computer manipulation with a structure. There are hundreds of molecular descriptors. For example, the program CODESSA evaluates some 400 graph theoretical descriptors and quantum chemical parameters for molecules to be considered in a structure-property-activity study [2]. Such a large number of descriptors raises the question how to select descriptors, besides the question how to interpret the resulting regressions, the problems that have hardly received due attention. We will briefly address this topic in this report. The slow advance in quantitative characterization of molecular attributes is primarily due to lack of precise definition of such attributes. What is molecular shape? How should molecular branching be defined? Can we measure the degree of chirality? Even the questions that at the first sight appear not difficult, like what is molecular size, 241 how is to be defined and how is one to measure molecular size need some scrutiny. Many molecules have the same number of atoms (or heavy atoms) and could be viewed as having the same size. But an index of limited resolution (such as the number of carbon atoms in alkanes, or number of carbon atoms and hetero-atoms in organic molecules) has also limited applicability when one is interested in minor variation in molecular properties, e. g., the variation in properties of isomers. Molecules having the same number o atoms in general will have different molecular volumes and different molecular weight. Is molecular volume or molecular weight to be used as the preferred descriptor of molecular size? As we will see there may be more than one way of measuring quantitatively the same molecular attribute. When statistical methods are used for study of the structure-property relationship two deciding factors ought to be considering: (1) The quality of the statistical analysis (measured by the correlation coefficient r, the coefficient of determination r2, the standard error s, the Fisher ration F or other statistical parameters); and (2) The possibility of structural interpretation of the results. Even a cursory screening of chemical literature shows that the above two important factors are too often ignored, if not fully, at least partially. This makes comparative QSAR, which should point to best statistical models and the best molecular descriptors for specific situations, difficult if not impossible. The emphasis in the development of molecular descriptors ought to be on the quality of the descriptors, which is measured by how well they cover the molecular structure-property space rather than how well they discriminate among structures. The structure discrimination is an important property of a descriptor and indirectly points to the capability of a descriptor to capture diverse structural features of a molecule. However, if those feature are at the same time not critical for determining the relative manifestation of property of interest descriptor which has high discriminatory power will be of little interest even if it has relatively simple structural interpretation (which most of highly discriminatory indices don’t have). In the following sections we will outline the recent development for characterization of a selection of molecular attributes. All the results outlined are relatively new, some of the results reported here have even not yet been published. For additional introductory material and in particular for extensions of the graph theoretical 242 approaches to the charcterization of 3D structures, which because of space we will not consider here, we would like to direct readers to ref. [4-6], which review most of recent developments in that area. CHARACTERIZATION OF MOLECULAR BRANCHING Already in 1973 Lovasz and Pelikan suggested the leading or the first eigenvalue of the adjacency matrix as a molecular branching index [3]. The connectivity index c [7], was initially called the branching index, and was thought to be a useful descriptor of molecular branching. Indeed, c parallels quite well the boiling points in smaller alkanes, which parallel qualitatively the degree of branching. However, the connectivity index c can be also computed for linear structures and cyclic structures without branches, so the name “connectivity index” instead of “branching index” which was suggested by Kier [8], is a better name for this index. Both branching indices have some limitations as they do not discriminate between several structures that show apparently different branching pattern. Thus, for example the leading eigenvalue for 3-methylheptane and 2,5-dimethylhexane is the same (l1 = 2.00000), while the connectivity index is same for 3-methylheptane and 4-methylheptane (c = 0.80806). Recently Kirby discussed the limitations of both these indices and offered some remedies that in particular improve the performance the connectivity index for larger alkanes [9]. Is there another branching index that does not show the limitations shown by the leading eigenvalue of the adjacency matrix and the connectivity index? Recently such a novel index was proposed [10]. It is based on the path matrix, a newly introduced matrix for graphs in which the matrix elements are expressed as the path subgraphs of a graph considered [11]. To obtain matrix in a numerical form one selects a mathematical property of interest for the subgraphs that appear as matrix elements. When one selects the leading eigenvalues (l1) of the subgraphs and then takes the leading eigenvalue of so constructed matrix one obtain the novel branching index (designated a ll1). For example, the new branching index for 3-methylheptane, 4-methylheptane and 2,5-dimethylhexane are 10.2359, 10.2211 and 10.1712 respectively. The ll1 index for 243 reported for all alkanes having n=10 or fewer carbon atoms [12]. It has also been reported for selected larger trees that have several coincidental topological indices [13]. So far no two trees having the same ll1 index have been found. In Table 1 below we illustrate the path matrix and in Table 2 the path eigenvalue matrix for 2-methylpentane. 1 3 4 5 Table 1 0p1 p2 p3 p4 p2 p1 0 p1 p2 p3 p1 p2 p1 0 p1 p2 p2 p3 p2 p1 0 p1 p3 p4 p3 p2 p1 0 p4 p2 p1 p2 p3 p4 0 Table 2 0 1.00000 1.41421 1.61803 1.73205 1.41421 1.00000 0 1.00000 1.41421 1.61803 1.00000 1.41421 1.00000 0 1.00000 1.41421 1.41421 1.61803 1.41421 1.00000 0 1.00000 1.61803 1.73205 1.61803 1.41421 1.00000 0 1.73205 1.41421 1.00000 1.41421 1.61803 1.73205 0 The elements of Table 2 are the leading eigenvalue of the adjacency matrices of paths p1-p4. The leading eigenvalue of so constructed path matrix is the new branching index. 244 CHARACTERIZATION OF MOLECULAR CYCLICITY Within the Chemical Graph Theory [14, 15] characterization of molecular cyclicity received some attention. Bonchev, Mekenyan and Trinajstic [16] were the first to propose cyclicity index for molecules. They developed a set of rules that were paralleled the relative magnitude of the Wiener index [17] for cyclic molecules. Although later their approach was improved [18] it still maintains its original ad hoc character in that occasionally human intervention was called for in order to resolve unforeseen ambiguities. Can one design molecular cyclicity index that will be less dependent on human intervention? Recently a novel cyclicity index was proposed which show promise. It has a high discrimination power and hopefully will resolve the questions of relative cyclicity among molecules with limited if any human intervention [19]. The index is based on the so called D/DD quotient matrix constructed from the elements of the graph distance matrix D [20], and the graph Detour matrix DD [21]. The element dij of the distance matrix is given by the distance between vertices i and j measured by the number of bonds between them, while the element ddij of the detour matrix is defined as the length of the longest path between vertices i and j. The i, j element of the quotient matrix D/DD is given by the ratio dij/ddij. The new index of molecular cyclicity is obtained as the average row sum of the so constructed matrix (suitably normalized). In Table 3 below we illustrate the D/DD matrix for the small bicyclic structure shown at the right of the table. Table 3 0 1/4 2/4 2/4 1/4 0 1/4 2/3 2/4 1/4 0 1/4 2/4 2/3 1/4 0 1/4 1/3 2/3 1/4 1/4 1/3 2/3 1/4 0 4 1 5 2 3 245 A comparison of two molecules having the same number of atoms but different cyclic structure shows that molecule with more cycles will have smaller row sums because it will allow for many atom pairs longer detour paths. Importantly, the approach permits construction of indices that apply locally, to individual rings and cycles. CHARACTERIZATION OF MOLECULAR SHAPE The shape, just as chirality, is not only the characteristic of three dimensional objects, although clearly the shape and chirality are very apparent properties of 3-dimensional objects and molecules. Objects embedded in 2 dimensional space also may have distinct shape. Even objects of a lesser dimension may show different shapes, as illustrated by shapes of fractals, the dimension of which typically varies between one and two (if embedded in a plane). Graphs despite being widely referred to as two dimensional objects are, strictly speaking, one dimensional mathematical objects [20]. One can however speak of the shape of a graph despite that graphs allow different pictorial representation and lack rigid structure. Simple indices of shapes of graphs are “girth,” “thickness,” “eccentricity,” (see ref. [20] for details), which to some degree discriminate among graphs of widely different forms. Kier [22] was first to propose shape indices for molecular graphs, the so called kappa shape indices. The kappa shape indices have been widely used in QSAR [23]. A close look at the definition of Kier’s k indices reveals their somewhat arbitrary character. The indices are based on comparison between selected descriptors for the extreme shapes for graphs having n vertices. For trees the extreme graphs are on one hand the linear graph (or a path graph) having n vertices and on the other hand the so called “star” graph in which all vertices are connected to a central vertex. The extreme graphs, the path graph and the star graph have visibly different count of paths. Kier has used the count of paths in these extreme cases as the reference points in his definition of the respective shape indices. One could say that k shape indices represent a measure of a departure of molecular shape from that of the extreme cases, which have set the scale for such measurement. Hence, we may speak of k as a relative shape indices. 246 Can we design an absolute shape index for molecular graphs, an index which will not require other structures as the reference points as has been the case with Kier’s kappa indices? Recently such new shape indices have been proposed [24]. An absolute index has an apparent advantage that it does not require a choice of the reference structures. Instead of using paths for characterization of the structures, including the extreme reference structures, on the “shape” scale we combine the count of paths of different length and the count of walks of different length for characterization of molecular shape. We will illustrate the approach on graph of 2-methylpentane. We will assume the standard numbering of carbon atoms. For each atom first we find the count of paths (pi) and count of walks (wi). The count of walks could continue indefinitely, since the lengths of walks is unbounded, but we stopped to match the walks to the paths because we will make the quotients (pi/wi): Paths Walks Path/Walks Atom p1 p2 p3 p4 w1 w2 w3 w4 q1 q2 q3 q4 1 1211 1 3 4 11 1 2/3 1/4 1/11 2 3 11 3 4 11 1 1/4 1/11 3 2 3 2 5 1 3/5 4 2 12 237 1 1/3 1/7 5 1112 1237 1 1/2 1/3 2/7 6 1211 1 3 4 11 1 2/3 1/4 1/11 Molecular shape indices are obtained by taking 1/6 of the sum of all (pi/wi) atomic contributions giving in the case of 2-methylpentane: q1=1.00000, q2=0.50278, q3=0.17785, q4=0.07792. The few applications of the new shape indices on selected properties of alkanes have shown that these indices lead to significantly better regressions than the Kier’s kappa shape indices In Table 4 below we show the statistical parameters for several thermodynamic properties of heptane isomers when using two shape indices p2/w2 and p3/w3 (the index p1/w1 is always 1 and is therefore of no interest). 247 Table 4 Property Coefficient of regression r 0.9340 Standard error s Fisher Ratio F Boiling points 2.33 oC 51.3 Pitzer steric factor 0.9729 0.438 133.0 Entropy 0.9451 1.57 62.8 Heat of formation 0.9819 0.078 201.4 Critical Heat 0.9609 0.373 89.5 Critical Pressure 0.9840 0.229 228.9 Heat of formation 0.9705 0.520 121.7 (vapor) No other simple molecular descriptors when restricted to the nine isomers of heptane give so good results. It remains however to be seen how will these indices be generalized and extended to molecular having heteroatoms. CHARACTERIZATION OF MOLECULAR CHIRALITY Lord Kelvin has pointed out already at the beginning of this century that chirality exists in spaces of different dimension, besides the generally recognized cases of chirality for objects in 3D [25]. We will restrict our attention here to chirality of molecules embedded in 2D and will illustrate novel and the first molecular descriptors of the chirality on smaller benzenoids. It is generally accepted that a chiral molecule and its antipodal counterpart if looked in an isolation will have all their mathematical or physical properties identical. Since molecular descriptors are nothing but mathematical properties of a structure it appears impossible to come with a design of a molecular descriptor, topological index, or quantum chemical parameter, which will distinguish a molecule and its enantiomer. In fact, all the reported topological indices, and there are several hundreds of such, are the same for a structure and its enantiomer. Chiral structures are 248 discriminated when considered in a chiral environment, such as occurs when a chiral molecule approaches a receptor (which as a rule is also chiral). So how can one design a descriptor which will give different values for a chiral molecule and its mirror image? The task appears impossible, yet very recently the first graph theoretical invariant (i. e., mathematical property) was designed that differentiate mirror image from its parent structure [26]. We will outline this construction on benzo [a] anthracene (shown below), which when embedded in a plane represents a chiral structure. Observe that if we decide to “walk” around the molecular periphery from selected atom (like atom #1) the two opposing direction of walk produce different walking pattern. We may speak of asymmetry of atomic environment along the perimeter and can measure this asymmetry by some graph theoretical descriptor. Let us first make a list of atomic valences as we go around the molecular periphery. For example, starting from atom 1 and including it, we have: clockwise 2, 2, 2, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 2, 3, 3, 2 anticlockwise 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 3, 2, 3, 2, 2, 3, 2, 2. In the next step we will make the corresponding sequences for the partial sums by adding successively elements of the series: clockwise 2, 4, 6, 9, 11, 13, 16, 18, 21, 23, 25, 27, 29, 32, 34, 37, 40, 42 anticlockwise 2, 4, 7, 10, 12, 15, 17, 19, 21, 23, 26, 28, 31, 33, 35, 38, 40, 42. 249 The above sequences, both belonging to carbon atom # 1, contain some information on the asymmetry of the periphery of the molecule when, starting from atom # 1, we go around the molecular periphery in opposite directions. In the next step we want to extract from the two sequences a single descriptor. One can define such descriptor by considering the difference between the anticlockwise (mathematical positive sense) and clockwise orientation (mathematically negative sense). For the above illustration we obtain as the difference D1 = 14, the subscript 1 relates to the atom selected. When this is completed for all atoms one obtains the results shown below: +14 +2 -10 -10 +2 +26 -22 -10 +20 -16 -4 -4 +8 ^^ -4 +14 +2 -10 +2 Clearly some atoms have more asymmetric environment while others show a lesser differentiation between two opposing directions of walking around the molecular periphery. The same calculation can be performed for the mirror image structure. The difference will only be in the sign of the individual contributions, while the corresponding atoms will maintain the same magnitudes. The calculations for a molecule and its enantiomer differ only in the exchange of the clockwise sequence for anticlockwise and vice versa. We will use the derived Di values for construction of a chirality index. Consider first simply the sum of S Di values. It is zero, and this is true for its mirror image. However, if we calculate the sums of the odd powers of the Di numbers we will in general obtain non zero result. In the case of benzanthracene using a simple normalization based on the number of atoms in a structure we obtain: S(Di)3/183 S(Di)5/185 S(Di)7/187 S(Di)9/189 S(Di)11/18 250 benzanthracene +2.17284 +5.07240 +10.97954 +23.70935 +51.05061 mirror image -2.17284 -5.07240 -10.97954 -23.70935 -51.05061 Alternative normalization could be considered but at this stage of the development of chirality indices the emphasis has been on the constructing the “impossible” rather than adjusting the derived index for a specific application. As we see the molecule and its enantiomer have positive and negative chirality indices respectively. The above is the first graph theoretical invariant (or call it the first topological index) that discriminates among enantiomers. If molecule is achiral all the components (Di)m/nm (m being odd and n is the number of atoms in the structure) occur in pairs, positive and negative, resulting in all sums S(Di)m/n to be zero. There is an important conceptual consequence of the possibility to construct chirality indices as those illustrated above. Hitherto chirality has been viewed as a relative property. That means that in order to assign a relative chirality label, such as left or right (e. g. d and l sugar) a molecule M and its mirror image M* have to be compared to some standard. For most molecules one come across this is possible, but as Ruch has pointed out for some objects this is not necessarily possible [26]. Thus, for example, one can speak of the left and the right shoe, or the left and the right glove, because these object can be matched to the left and the right hands which are assumed to be the standard of reference. However, as indicated by Ruch, one can come across chiral potatoes and not be able to tell which is left and which is right. Moreover, as has been pointed out by Randic and Razinger [28], the classification of chiral molecules analogous to the left and the right isomers may depend on the standard chosen. So for one standard one may obtain for a molecule M and its enantiomer M* to be classified to correspond to standards A and A*, but for some other standards the assignment of the label (* or no *) may reverse. This clearly points to the relative nature of classification of molecules and their enantiomers. In our case the situation is different. We do not need standards to determine the chiral character of a molecule M and its enantiomer. If the chirality indices are positive we have M+ if they are negative we have M-, independent of any such assignment for any other pair of molecules. 251 CONCLUDING REMARKS In this report we have limited our attention to molecules represented by molecular graphs rather then molecules represented as 3D structure. Some of the recent development outlined here can be readily extended to 3D structures, other, including the chirality indices, may require more efforts and some imagination. Be as it may, the purpose of this communication has been to raise consciousness among the practitioners towards the need for meaningful, that is, structurally interpretable and conceptually simple descriptors. Despite there being hundreds of descriptors apparently there is still room for the improvement. Selection of descriptors to be used in structure-property-activity studies should not be delegated solely to the computers although the statistical criteria will continue to be useful for preliminary screening of descriptors taken from a large pool. Often in an automated selection of descriptors a descriptor will be discarded because it is highly correlated with another descriptor already selected. But what is important is not whether two descriptors parallel one another, i.e., duplicate much of the same structural information but whether they in those parts that are important for structure-property correlations. If they differ in the domain which is important for the property considered both descriptors should be retained, if they differ in the parts that are not relevant for the correlation of considered property then one of them can be discarded. Hence, the residual of the correlation between two descriptors should be examined and kept or discarded depending on how well it can improved the correlation based on already selected descriptors. Alternatively, one should replace the set of descriptors used by descriptors that can be extracted from them through the orthgonalization procedure that has been introduced in regression analysis [29-33], perhaps somewhat belatedly, but should not because of the late start be overlooked in the future. REFERENCES 252 1. Wiener, remarks at the after dinner talk at the first International Conference on Mathematical Chemistry, Athens, Ga 2. A. R. Katritzky, V. Lobanov and M. Karelson, Codessa (Comprehensive Descriptors for Structural and Statistical Analysis), University of Gainesville, Fl., 1944. 3. L. Lovasz and J. Pelikan, On the eigenvalue of trees, Period. Math. Hung. 1973, 3, 175-182. 4. M. Randic and M. Razinger, On Characterization of 3D Molecular Structure, in: From Chemical Topology to Three-Dimensional Geometry (A. T. Balaban, Ed.), Plenum Press, New York, 1997. 5. M. Randic, On characterization of chemical structure, J. Chem. Inf. Comput. Sci. 1997, 37, 672-687. 6. M. Randic, Topological Indices, in: Encyclopaedia of Computational Chemistry (P. von Rague Schleyer, Editor-in-Chief), Wiley & sons, London (in press). 7. M. Randic, On characterization of molecular branching, J. Am. Chem. Soc. 1975, 97, 6609-6615. 8. L. B. Kier, L. H. Hall, W. J. Murray and M. Randic, Molecular connectivity I. Relationship to non-specific local anaesthesia, J. Pharm. Sci. 1975, 64, 1971-1974. 9. E. C. Kirby, Sensitivity of topological indices to methyl group branching in octanes and azulenes, or what does a topological index index? J. Chem. Inf. Comput. Sci. 1994, 34, 1030-1035. 10. M. Randic, On molecular branching, Acta Chim. Slov. 1997, 44, 57-77. 11. M. Randic, D. Plavsic and M. Razinger, Double invariants, MATCH 1997, 35, 243-259. 12. M. Randic, On structural ordering and branching of acyclic hydrocarbons, J. Math. Chem. (submitted) 13. M. Randic, X. Guo, and S. Bobst, Use of path matrices for characterization of molecular structure, Proc. Workshop on Mathematical Chemistry, Rutgers Univesity, May 1998. 14. Chemical Applications of Graph Theory, (A. T. Balaban, Ed.), Academic Press, London, 1976. 15. N. Trinajstic, Chemical Graph Theory, CRC Press, Boca Raton, Fl. 16. D. Bonchev, O. Mekenyan and N. Trinajstic, Topological characterization of cyclic structures, Int. J. Quant. Chem. 1980, 17, 845-893. 17. H. Wiener, Structural determination of paraffin boiling points, J. Am. Chem. Soc. 1947, 69, 17-20. 18. D. Bonchev, A. T. Balaban, X. Lui, and D. Klein, Molecular cyclicity and centrality of polycyclic graphs. I. Cyclicity based on resistance distances or reciprocal distances, Int. J. Quant. Chem. 1994, 50, 1-20. 19. M. Randic, On characterization of cyclic structures, J. Chem. Inf. Comput. Sci. 1997, 37, 1063-1071. 20. F. Harary, Graph Theory, Addison-Wesley, Reading, Ma, 1969. 21. D. Amic and N. Trinajstic, On the detour matrix, Croat. Chem. Acta 1995, 68, 873-882. 22. L. B. Kier, Shape indexes of orders one and three from molecular graphs, Quant. Struct.-Act. Relat. 1986, 5, 1-7. 23. L. B. Kier, Indexes of molecular shape from chemical graphs, Med. Res. Rev. 1987, 7, 417-440. 24. M. Randic, New shape indices for molecular graphs, in preparation for publication. 25. Lord Kelvin, Baltimore Lectures on Molecular Dynamics and the Wave Theory of Light, pp. 439, 619, C. J. Clay and Sons Publisher, London (1904). 26. M. Randic, unpublished 27. E. Ruch, Angew. Chem. Intl. Edn. Engl. 1977, 16, 65. 28. M. Randic and M. Razinger, Molecular shapes and chirality, J. Chem. Inf. Comput. Sci. 1996, 36, 429-441. 29. M. Randic, Orthogonal molecular descriptors, New J. Chem. 1991, 15, 517-525. 30. M. Randic, Resolution of ambiguities in structure-property studies by use of orthogonal descriptors, J. Chem. Inf. Comput. Sci. 1991, 31, 311-320. 31. M. Randic, Fitting of nonlinear regressions by orthogonalized power series, J. Comput. Chem. 1993, 14, 363-370. 32. M. Randic, Curve-fitting paradox, Int. J. Quant. Chem: Quant. Biol. Symp. 1994, 21, 215-225. 33. M. Randic, Orthosimilarity, J. Chem. Inf. Comput. Sci. 1996, 36, 1092-1097. 253 Povzetek. V članku poročamo o uporabi in najnovejšem razvoju invariant teorije grafov za karakterizacijo razvejanosti molekul, cikličnosti, zunanje oblike ter njihove kiralnosti.