Acta Chim. Slov. 2004, 51, 231-243. 231 Scientific Paper FAST ESTIMATION OF SURFACE COMPLEMENTARITY IN PROTEIN COMPLEXES Giacomo Franzota and Oliviero Carugob a International School for Advanced Studies, via Beirut 4, 34014 Trieste, Italy and Sincrotrone Trieste, Strada Statale 14 - km 163,5 in AREA Science Park 34012 Basovizza, Trieste, Italy b Department of General Chemistry, Pavia University, viale Taramelli 12, 27100 Pavia, Italy and TASC-INFM National Laboratory, Strada Statale 14 - km 163,5 in AREA Science Park 34012 Basovizza, Trieste, Italy Received 22-12-2003 Abstract A novel measure of protein surface complementaritv, sc_pride, is proposed. Each surface patch is represented by the distribution of the inter-atomic distances and the degree of similarity between two surface patches is estimated via a contingency table analysis of their two inter-atomic distance distributions. Such a low resolution surface representation allows very fast complementarity estimations that could find applications in protein-protein interaction prediction. The performance of sc_pride is compared to that of other surface complementarity measures with a very large set of protein-protein complexes obtained with docking simulations and the ability of sc_pride to recognize the surface complementarity is tested on a non-redundant set of experimentally determined crystal structures of protein-protein complexes. Key words: protein structure, protein-protein interaction, protein surface Introduction The interactions between proteins, with consequent formation of interaction networks, are of fundamental importance in modem molecular biology.1"4 The life depends in fact on the ability of each protein to correctly recognize its partners, which in turn recognize other proteins. The shape complementarity is a mandatory requirement in protein recognition. Several algorithms for estimating the surface complementarity have so far been developed.2 The oldest were based on very detailed stereochemical descriptions of the protein surfaces. Later on, an increasing attention was devoted to the inclusion of the intrinsic molecular flexibility into the description of surface geometry.5'6 In other words, a low resolution portrayal of the protein surface might allow one to overcome the problem of predicting the conformational rearrangements consequent to the inter-molecular association. G. Franzot, O. Carugo: Fast Estimation of Surface Complementarity in Protein Complexes 232 Acta Chim. Slov. 2004, 51, 231-243. The major drawback of the methods for estimating the surface complementarity depends on the fact they are usually associated with docking simulations. These computational procedures, aimed to predict the stereochemistry of protein complexes, are very slow because of the comlex task of analysisng the immense conformational space that includes both the relative orientation of the interactiong partners and their stereochemical flexibility.5'6 Moreover, within a docking simulation, the surface complementarity can be estimated only if the relative position of the interacting partners has been hypothesized. In the present paper we present a new description of the protein surface geometry based on the distribution of the inter-atomic distances, an approach reminiscent of the extremety fast fold comparison procedure implemented in PRIDE.7'8 The two surface patches that must be compared are represented by the two distributions of their inter-atomic distances which are then compared through a contingency table analysis,9 resulting in the surface complementarity score sc_pride (see Experimental section for details). Such a surface representation is intrinsically a low resolution description of the surface geometry because the possible atomic displacements due to the complex formation modify some of the inter-atomic distances but do not influence to a great extent the distance distribution. Moreover, in such a representation, the geometric description becomes independent of the 3D structure of the protein-protein complex. Each single monomeric protein surface can be described independently of the position of the other protein partner. Such a surface geometry description can thus be used in computational approaches that partition the protein surface into adjacent patches, like for example PUZZLE.10 Given its computational simplicity and speed and given that it does not need any assumption on the relative position of the interacting partners, such a novel procedure could therefore be of extreme importance in large scale virtual screening studies. Results and discussion Comparison between sc_pride and other measures of surface complementarity A very large data set of protein-protein complex 3D structures was obtained by the computational docking simulations summarized in Table 1. Each of the theoretical G. Franzot, O. Carugo: Fast Estimation of Surface Complementarity in Protein Complexes Acta Chim. Slov. 2004, 51, 231-243. 233 models was assigned the scpride values together with the FADE, SC, and scscore values. FADE (Fast Atomic Density Evaluator) values measure the shape complementarity for docked complexes.u Each surface is described as a series of contiguous grooves and protruding regions through a fractal atomic density index.12 The latter is the slope of the relationship between log(N) and log(r), where N is the number of atomic centers within a sphere of radius r centered on a dot of a Connolly molecular accessible surface.13 High indices are associated with deep grooves and low values are associated with protruding regions, remembering alternative definitions of protrusion at the protein surface.14 The complementarity of the protrusion degree of neighbors surface patches result in a FADE value that is inversely proportional to the protein-protein surface complementarity. SC values are an alternative measure of surface complementarity.15 They depend on the relative orientation of two unit vectors, one outwardly oriented and normal to the molecular accessible surface of a protein, and the other, inwardly oriented and normal to the surface of the other protein. The first unit vector originates from any point P of the surface of the first protein. The second vector starts at the point of the surface of the second protein that is closest to P. If the two surfaces are parallel around P, the two vectors are also parallel and their scalar product reaches its maximum possible value. The 50th percentile of these scalar products that span ali the points P of each surface is assumed to measure the surface complementarity at the protein-protein interface. Large SC values are associated with highly complementary surface patches. While both FADE and SC values depend on the molecular accessible surfaces, the computational docking software suite 3D-Dock16 provides an alternative defmition of surface complementarity. One of the two proteins, the complexation of which is simulated, is roto-translated around the other through the algorithm of Katchalski- Katzir17 and the surface complementarity is computed, after each roto-translation, by grid discretisation of the molecules. Core overlaps between grids are penalized while surface overlaps represent a positive contribution to the protein-protein recognition. The resulting scscore values are proportional to the degree of complementarity. 1,000 theoretical models were randomly selected from each of the 16 docking simulations. Each of the 16,000 protein-protein complexes was given the FADE, SC, scscore, and sc_pride values. Table 2 shows the linear correlation coefficients between G. Franzot, O. Carugo: Fast Estimation of Surface Complementarity in Protein Complexes 234 Acta Chim. Slov. 2004, 51, 231-243. Table 1. Protein-protein complexes used in docking simulations. For each protein in each docking simulation the following information is provided: the PDB identification code (Idcode), the chain identifier (Chain), the protein name (Protein), and the biological source (Source). Idcode Chain Protein Source laOo A Chea Escherichia Coli laOo B Chey Escherichia Coli la2k A Nuclear Transport Factor 2 Rattus nurvegicus la2k D Ran, Gsp1P Canis familiaris la4y A Angiogenin Homo sapiens la4y B Ribonuclease inhibitor Homo sapiens lani E Trypsin Sus scrofa lani I Trypsin inhibitor Hirudo medicinalis lb2s A Barnase Bacillus amyloliquefaciens lb2s D Barstar Bacillus amyloliquefaciens lcly A Ras binding Homo sapiens lcly B Rap-1° Homo sapiens lclv A ?-Amylase Tenebrio molitor lclv I ?-Amlylase inhibitor Amaranthys Hypochondriacus ldpj A Proteinase A Saccharomyces cerevisiae ldpj B Proteinase inhibitor Saccharomyces cerevisiae lfc2 D Immunoglobulin Fc Staphylococcus aureus lfc2 C Fragment B of protein A Homo sapiens Me E Elastase Sus scrofa Me I Elafin Homo sapiens ljat A Ubiquitin Conjugating enzyme E2 Saccharomyces cerevisiae ljat B Ubiquitin Conjugating enzyme Mms2 Saccharomyces cerevisiae ljhl H Antibody D11.15 Mus musculus ljhl A Lysozyme Phasianus colchicus lmee A Serine proteinase Bacillus pumilus lmee I Eglin C Hirudo medicinalis ltx4 A Rho gap Homo sapiens ltx4 B Rho a Homo sapiens lugh E Uracil-DNA Glycosylase Homo sapiens lugh I Glycosylase Inhibitor Bacteriophage PBS2 2jel H Jel42 Fab Fragment Mus musculus 2jel P His-Containing Protein Escherichia coli Table 2. Linear correlation coefficients between various surface complementarity scores. The average values, with standard deviations in parentheses, were computed on 16 sets of 1,000 theoretical models obtained with the 3D-Dock software suite. ____________________________FADE___________________SC__________________scscore_______ SC -0.249 (0.017) scscore -0.166 (0.018) 0.035 (0.010) sc_pride_________________-0.159 (0.018)___________0.066 (0.018)___________0.036 (0.012) G. Franzot, O. Carugo: Fast Estimation of Surface Complementarity in Protein Complexes Acta Chim. Slov. 2004, 51, 231-243. 235 these four measures of surface complementarity. Figure 1 shows the dependence on sc_pride of the FADE, SC, and scscore values. The correlation coefficients are very small, though statistically different from zero. As expected, the FADE values are inversely proportional to the three other scores and the latter ones are ali positively correlated. The sc_pride values correlate with the other scores as it must be expected. They increase as the SC and scscore values increase and the decrease as the FADE values increases. The discrepancy between various shape scoring functions is quite surprising and has never been described and commented previously. It must nevertheless be observed that the protein-protein complexes examined here are produced by rigid body docking simulations. Conformational rearrangements, caused by the complexation, are thus not considered. This might account also for the fact that the SC values computed over the 16,000 three-dimensional models are relatively smaller (Figure 1) than those reported for real protein-protein complexes, which are around 0.6 or higher. Dependence of sc_pride on the interface dimension In order to compute sc_pride values, protein surface patches are described by the distributions of their inter-atomic distances. The information provided by these distributions is obviously dependent on the dimension of the surface patch. For example, the smallest patches containing only one or two atoms would be identical to any other patch. At the other extreme, a very large patch containing many atoms could be associated with inter-atomic distances uniformly distributed and thus it would be impossible to discriminate similar from dissimilar pairs of surface moieties. Sc_pride values were computed for the ensemble of surface patches of the proteins listed in Table 3 and shown in Figure 2. These were selected because they are very different one from each other. Ibz6 is a classical compact globin fold, lcdm and 4cln are calmodulins but while lcdm is in the bent conformation, adopted in the presence of the substrate (not shown in the figure), 4cln is in the extended conformation, 4aah is a beta-propellor, and lqsa is a U-shaped alpha-super-helical domain. G. Franzot, O. Carugo: Fast Estimation of Surface Complementarity in Protein Complexes 236 Acta Chim. Slov. 2004, 51, 231-243. Figure 1. Dependence on sc_pride of three different measures of surface complementarity (FADE, SC, and scscore). Standard deviations are indicated by vertical bars. The data were obtained from a set of 16,000 protein-protein complex structures simulated by computational docking. Table 3. Protein structures used to analyze the dependence of the sc_pride values on the dimension of the surface patches that are compared. For each protein the following information is provided: the PDB identification code (Idcode), the chain identifier (Chain), the protein name (Protein), and the biological source (Source). Idcode Chain Protein Source lbz6 A Myoglobin Physter catodon lcdm A Calmodulin Bos taurus lqsa A Transglycosylase Slt70 Escherichia coli 4aah A Methanol dehydrogenase Methylophilus methylotrophus JV3A1 4cln Calmodulin Drosophila melanogaster G. Franzot, O. Carugo: Fast Estimation of Surface Complementarity in Protein Complexes Acta Chim. Slov. 2004, 51, 231-243. 237 For each protein, n surface patches containing the m (20