ISSN 2590-9770 The Art of Discrete and Applied Mathematics 8 (2025) #P1.08 https://doi.org/10.26493/2590-9770.1789.4e7 (Also available at http://adam-journal.eu) Advanced clique algorithms for protein product graphs* Janez Konc† , Dušanka Janežič‡ University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technologies, Glagoljaška ulica 8, SI-6000 Koper, Slovenia Received 1 April 2024, accepted 13 January 2025, published online 25 February 2025 Abstract In this paper, we give a comprehensive overview of the development of clique algo- rithms and their use for drug design based on the search for cliques in protein product graphs. The maximum clique problem is a computational problem of finding largest sub- sets of vertices in a graph that are all pairwise adjacent. A related problem is the maximum weight clique problem and the highest weight k-clique problem, which both extend the al- gorithm to weighted graphs. The review covers our developed algorithms, starting with our improved branch-and-bound algorithm for finding maximum cliques in undirected graphs from 2007 up to the recent developments of algorithms for weighted graphs in 2024. We show the application of these algorithms to early stages of drug discovery, in particular to protein binding site detection based on protein similarity search in large protein databases and to protein-ligand molecular docking. Keywords: Cliques, protein product graphs, applications. Math. Subj. Class.: 05C69 1 Introduction The maximum clique problem (MCP) is a computational challenge in which the goal is to find the largest subset of vertices within a graph in which every vertex is directly con- nected to every other vertex in the subset. MCP is classified as NP-hard due to its inherent difficulty in finding an optimal solution [3]. Another version of this problem, the maximum weight clique problem (MWCP), op- erates with weighted graphs. In such graphs, each vertex is assigned a numerical weight, *Dedicated to Professor Dragan Marušič on the occasion of the 70th birthday. †Equally Contributed. ‡Corresponding author. Equally Contributed. E-mail addresses: konc@cmm.ki.si (Janez Konc), dusanka.janezic@upr.si (Dušanka Janežič) cb This work is licensed under https://creativecommons.org/licenses/by/4.0/ 2 Art Discrete Appl. Math. 8 (2025) #P1.08 usually indicating its importance, utility or cost. A maximum weight clique is a clique in the weighted graph with the highest sum of the weights of its vertices. The clique problem is useful for a wide range of applications. Its versatility goes be- yond graph theory, as it serves as a model for a variety of challenges in different disciplines. The clique algorithms have been used in many research and industry settings [12]. In this review, we first present the clique algorithms for the MCP and MWCP we have developed between years 2007 and 2024 (Figure 1). We then present their use in drug design through the use of specially constructed protein product graphs. We focus on two areas, namely protein binding sites detection and protein-ligand molecular docking. The developed tools, collectively known as Protein Binding Site (ProBiS) tools, enable highly efficient drug design [7]. Figure 1: Development of clique algorithms for protein binding sites and ligands detection. 2 Protein product graphs for protein structural comparisons For many years now, we have used the developed clique algorithms in a real drug design application. First, we represented proteins as graphs, which is a natural way to represent molecules, since they are composed of atoms and bonds, which are natural graphs. In these, so called “protein graphs”, vertices are positioned at the geometric centers of functional groups of protein surface amino acids, with each vertex representing a physicochemical property of the underlying group: acceptor, donor, π-π stacking, aliphatic, or acceptor- donor. Two vertices, ui and uj , within a protein graph G are considered adjacent if the distance between them (ui, uj) is less than 15 Å [5]. The “protein product graph” Gp of two protein graphs, G1 and G2, is defined on the vertex set V (Gp) = V (G1) × V (G2). To compare a pair of protein graphs, we identify a maximum clique within their product graph, where the maximum clique corresponds to the superimposition aligning the most vertices of the compared protein graphs. Each vertex in the protein product graph (ui, vi) consists of two component vertices: one from the first protein graph (ui ∈ V (G1)) and one from the second protein graph (vi ∈ V (G2)). Generally, a protein product graph has n1×n2 vertices if the respective protein graphs have J. Konc and D. Janežič: Advanced clique algorithms for protein product graphs 3 n1 and n2 vertices. However, we reduce its size by considering only vertices with identical component vertex colors (physicochemical properties). Additionally, component vertices must exhibit similar neighborhoods in their corresponding protein graphs. This similarity is determined based on the comparison of distance matrices representing discretized distances between all pairs of vertices in the neighborhood. Two protein product graph vertices (ui, vi) and (uj , vj) are considered adjacent (we connect them by an edge) if the distances between their respective component vertices in both protein graphs are nearly identical, i.e., they differ by less than 2 Å [5]. We constructed ten benchmark protein product graphs, each derived from a pair of pro- tein structures obtained from the Protein Data Bank (PDB). To assess the impact of protein size (i.e., the number of amino acids) on the performance of the maximum clique search, we generated product graphs from proteins ranging from approximately 50 to 2000 amino acids. Additionally, we considered protein pairs sharing sequence identities ranging from approximately 10% to 95%. These resulting protein product graphs serve as representative standard tests for evaluating newly developed maximum clique algorithms. The benchmark set is available at http://insilab.org (see MaxCliquePara software) [1]. 3 Maximum clique algorithm In a work from 2007 [4], we developed a new algorithm named MaxClique for finding a maximum clique in an undirected graph, in which we improved an approximate coloring algorithm originally used in an algorithm of Tomita et al. [17]. We used our new coloring algorithm to provide bounds to the size of the maximum clique in a basic algorithm, also developed in [6]. Additionally, we extended this basic algorithm to include dynamically varying bounds, which resulted in our MaxCliqueDyn algorithm that proved to be signifi- cantly faster than the basic algorithm that we started with. The MaxCliqueDyn algorithm, as described in [4], offers significant advances over previous such algorithms, in particular over the previous algorithm [17] through the integration of two innovative features. First, we developed an improved approximate coloring algorithm called ColorSort. This algorithm preserves the nodes within the candidate set of graph vertices in a strate- gically arranged descending order based on their degrees. This ordering is based on the insight that the assignment of nodes to color classes, which normally disturbs the order- ing, is only necessary above a certain threshold. This threshold is calculated as kmin = |Qmax| − |Q| + 1, where |Qmax| is the size of the current maximum clique and |Q| is the size of the clique discovered on the current branch of the search tree. Vertices with colors below kmin cannot be used to construct subsequent cliques on the current branch, and therefore they can retain their original order. This approach consistently reduces both the number of steps required to identify a maximal clique and the time spent doing so. Second, we designed the algorithm to adopts tighter, albeit more computationally inten- sive, upper bounds for a portion of the search space to speed up the process of identifying the maximum clique. In this context, the nodes within the set of graph vertices in the lower branches of the search tree are re-sorted based on their degrees. This sorting mechanism is supported by a counter of steps up to a certain level Slevel of the search tree and a counter of total steps taken so far Stotal and a parameter T limit that dynamically activates or deactivates the sorting during the execution of the algorithm if Slevel/Stotal < T limit. This limits the use of computationally expensive resorting to lower levels (close to the root node) of the search tree, where such resorting has the greatest effect on the search effi- 4 Art Discrete Appl. Math. 8 (2025) #P1.08 ciency. This heuristic improves the overall performance of the algorithm on a variety of DIMACS and random graphs and provides the flexibility to fine-tune the performance for specific graph types by adjusting the T limit parameter. 4 Parallel maximum clique algorithm Recognizing the importance of modern multi-core computers and the possibility they give to algorithmic efficiency, in a 2013 work [1], we introduced a new exact parallel maxi- mum clique algorithm called MaxCliquePara. This novel branch-and-bound algorithm for finding a maximum clique in undirected general and protein graphs is based on concepts from two well-established sequential algorithms. We first developed a version for a single computer core, a sequential MaxCliqueSeq algorithm. This algorithm proved to be par- ticularly fast compared to the reference algorithms for both DIMACS benchmark graphs and protein-derived product graphs used for protein structure comparisons. We then paral- lelized the MaxCliqueSeq algorithm by splitting the branch-and-bound search tree across multiple cores, resulting in the MaxCliquePara algorithm. By efficiently utilizing all avail- able cores, the new parallel MaxCliquePara algorithm significantly outperformed the other algorithms tested. On a 12-core computer system, parallelization resulted in up to two orders of magnitude faster execution on large DIMACS benchmark graphs and up to one order of magnitude faster execution on protein product graphs. The MaxCliquePara algorithm uses the multithreading techniques that are supported in most programming languages without additional libraries or other software support. This makes the algorithm transferable to other languages and operating systems. It can be ex- ecuted on most modern multicore computer architectures. The parallel algorithm behaves identically to the MaxCliqueSeq algorithm when run on a single core, but takes slightly longer to run due to the additional thread management code. However, when running on multiple cores, this extra effort is easily compensated if the maximum clique problem to be solved is not trivial, e.g. for graphs with high density. The parallel algorithm explores sev- eral branches simultaneously. This reduces the number of steps compared to the sequential algorithm and leads to a faster execution time (speedup) of the MaxCliquePara algorithm. The experiments performed show significant speedups of the MaxCliquePara algorithm for a smaller number of parallel cores on most of the graphs tested. With the maximum 12 physical cores available, and even with the additional 12 hyperthreaded cores (24 cores in total), the speedup scales strongly for larger and more computationally intensive graphs. An important advantage of MaxCliquePara is the use of shared memory parallelism, which enables low overhead and thus fast execution on a wide range of graph sizes. Superlinear speedups are observed that are consistent with expectations for an algo- rithm that traverses a tree of possible solutions to search. For graphs that are “simple" for the sequential maximum clique algorithm, i.e., with an execution time on the order of milliseconds, parallelism does not provide significant benefits. The MaxCliquePara algo- rithm is currently considered to be one of the fastest general maximum clique algorithms for almost all but the most trivial graphs. 5 Maximum clique algorithm using machine learning In another work from 2022 [14], we introduced machine learning into the MaxCliqueDyn algorithm. We extended the MaxCliqueDyn algorithm with a first phase of machine learn- ing to predict the dynamic parameter (T limit) that is best suited for each input graph, J. Konc and D. Janežič: Advanced clique algorithms for protein product graphs 5 resulting in the new MaxCliqueDyn-ML algorithm. T limit is an empirical parameter that determines the course of the algorithm as described in a previous section. Such graph type adaptability based on modern machine learning is a novel approach that has not yet been used in most graph-theoretic algorithms. We showed empirically that the resulting new algorithm MaxCliqueDyn-ML improves the search speed for certain types of graphs, in particular for molecular docking graphs used in drug discovery that determine energeti- cally favorable conformations of small molecules at a protein binding site. In such cases, the increase in speed is twofold. 6 Maximum weight clique algorithm Weighted graphs are important for drug design because they can represent binding strengths, with the maximum weighted clique in these graphs indicating the strongest bonding site. In a work from 2024 [15], we introduce a new algorithm MaxCliqueWeight for identi- fying a maximum weight clique in a weighted graph, and its variant MaxCliqueDynWeight with dynamically varying bounds. This algorithm uses an efficient branch-and-bound ap- proach with a new weighted graph coloring algorithm that efficiently determines upper weight bounds for a maximum weighted clique in a graph. The algorithm is based on our MaxClique fast branch and bound algorithm for finding a maximum clique in an undi- rected graph. We test the newly developed algorithm and its dynamic variant on random weighted graphs of up to 10,000 vertices and on standard benchmark DIMACS graph, used in different research fields and industries, available at http://insilab.org/maxcliqueweight. We show that our newly developed algorithms are up to three orders of magnitude faster than a comparable Cliquer algorithm on random graphs, with the largest speedup achieved by the MaxCliqueDynWeight algorithm on a graph with 100 vertices and an edge density of 0.95; the MaxCliqueWeight algorithm is up to six orders of magnitude faster on DIMACS graphs, with the highest speedup on the san200_0.7_2 graph with 200 nodes and an edge density of 0.7. The main difference of the MaxCliqueWeight algorithm over the MaxClique algorithm is that in the coloring algorithm, we use the assignment of vertices to color classes as a means to calculate weight upper bounds for the maximum weight of a clique that is reachable from each vertex by following down the branch from that vertex. The idea is to calculate the weight of each color class as the maximum weight of its containing vertices. The sum of color class weights is then used as the upper bound to the weighted clique in a graph. 7 Highest weight k-clique algorithm In drug design, searching for k-cliques is important because both subgraph isomorphism and docking site search involve identifying specific substructures of a given size. The k parameter represents this size, allowing researchers to match smaller graphs or docking molecules to potential binding sites accurately. In a work from 2024 [16], we introduced a new algorithm for identifying up to N highest-weight k-cliques in a vertex-weighted graph, where k and N are given as param- eters. This newly developed algorithm is a versatile graph-theoretical algorithm suitable for various types of vertex-weighted graphs and universal problem solving. To the best of our knowledge, this is the first algorithm of its kind. It is an extension of the MaxClique algorithm. The new algorithm finds up to N highest weighted k-cliques in an undirected vertex-weighted graph. A k-clique is defined as a clique with exactly k vertices, where 6 Art Discrete Appl. Math. 8 (2025) #P1.08 k can take any positive integer value. If k is less than or equal to the size of the maxi- mum clique in a given graph, then the decision problem will have at least one k-clique as a solution. This algorithm is used in molecular docking, so we have created a benchmark set of weighted docking graphs that is available at http://insilab.org/kcliqueweight for other re- searchers to test their algorithms. The algorithm allows to determine the conformation with the lowest energy of a docked ligand within a protein binding site. In this context, the different ligand conformations are represented as a docking graph. This graph consists of vertices, each denoting a ligand fragment with weights corresponding to their docking energy, and edges indicating the connectivity between the fragments. A k-clique with the highest weight in this graph, where the parameter k is the number of fragments of a ligand, corresponds to the conformation of the ligand with the lowest energy. 8 Applications of clique algorithms in drug design Based on the developed clique algorithms, we have developed new ProBiS tools - web servers, databases and algorithms to recognize structurally similar protein binding sites based on the fact that protein surface structures are conserved in binding site regions [6]. These tools search for local similarities in the physico-chemical properties of different protein surface structures, independent of sequence or folding. The proteins are modeled as protein graphs, i.e. as rigid 3D objects consisting of vertices and edges. The developed algorithms are used for the prediction of protein binding sites and ligand transposition in experimental and artificial intelligence modeled protein structures. Based on the MaxCliqueDyn algorithm, we have developed the ProBiS algorithm [5], which aligns and overlays complete protein surfaces, surface motifs or protein binding sites. It enables pairwise alignments of entire protein structures or selected binding sites as well as a fast search of hundreds of thousands of proteins for similar protein binding sites. The algorithm can find similar binding sites even in proteins with different folds and without prior knowledge of their location. The ProBiS algorithm can be used in parallel on one or more CPU platforms. We have developed the ProBiS-Dock docking algorithm [8], which enables fast dock- ing of small molecules to proteins using a newly developed fast highest weight k-clique algorithm. Small molecules and proteins are treated as fully flexible entities that allow conformational changes in both upon ligand binding. An overview of specific case studies in which these algorithms have been used shows their practical application in drug discovery. For example, ProBiS-based methods were used in the inverse docking of various plant substances. Notable studies include the docking of these compounds reported in Refs. [9, 10, 11]. In addition, ProBiS-based methods have contributed to the identification of water-binding sites, with important results published in [2, 13]. 9 Conclusions In summary, the research presented in this paper demonstrates significant advances in clique algorithms and their applications, particularly in the field of drug design. The re- view begins with an investigation of the maximum clique problem (MCP) and its weighted variant (MWCP), both of which are fundamental challenges in complexity theory. Based on insights from previous work, we developed novel algorithms to efficiently solve these prob- J. Konc and D. Janežič: Advanced clique algorithms for protein product graphs 7 lems, including MaxCliqueDyn, MaxCliquePara and MaxCliqueDyn-ML, each of which is tailored to specific computational challenges and improves solution accuracy. Our investigations went beyond traditional clique problems to include the identification of k-cliques with the highest weight in vertex-weighted graphs, a groundbreaking endeavor that is of great use in various graph-theoretic applications. These innovations culminated in the development of the ProBiS suite of tools that utilize clique algorithms to facilitate key tasks in drug discovery, including protein binding site recognition and ligand docking. By constructing protein product graphs and conducting extensive empirical evaluations, we were able to demonstrate the efficacy and scalability of our algorithms across a spectrum of graph sizes and complexities. In particular, our parallelization efforts resulted in signif- icant speedups, especially on multicore architectures, underscoring the practical relevance of using modern computational paradigms to improve algorithmic efficiency. Furthermore, our integration of machine learning techniques into clique algorithms rep- resents a breakthrough approach to algorithmic optimization, enabling adaptive parameter selection and improving solution quality for specific graph types. This fusion of traditional algorithmic principles with contemporary machine learning methods is an example of a synergistic paradigm shift in computational problem solving. The culmination of these efforts in the ProBiS suite provides researchers in the fields of molecular biology, pharmaceuticals and structural bioinformatics with unprecedented capabilities for protein structure analysis, binding site prediction and molecular docking studies. By providing accessible, powerful tools for complex biological analysis, our work paves the way for accelerated drug discovery processes and deeper insights into protein- ligand interactions. In essence, our contributions highlight the transformative potential of clique algorithms in addressing real-world challenges at the intersection of graph theory, computational biol- ogy and pharmaceutical science. As we continue to refine and expand these methods, we anticipate further advances that will catalyze innovation and support researchers in their quest to develop new therapeutics and elucidate biological mechanisms. ORCID iDs Janez Konc https://orcid.org/0000-0003-0160-3375 Dušanka Janežič https://orcid.org/0000-0003-4067-0116 References [1] M. Depolli, J. Konc, K. Rozman, R. Trobec and D. Janežič, Exact parallel maximum clique algorithm for general and protein graphs, J. Chem. Inf. Model. 53 (2013), 2217–2228, doi: 10.1021/ci4002525, https://doi.org/10.1021/ci4002525. [2] M. Jukič, J. Konc, D. Janežič and U. Bren, ProBiS H2O MD approach for identification of conserved water sites in protein structures for drug design, ACS Med. Chem. Lett. 11 (2020), 877–882, doi:10.1021/acsmedchemlett.9b00651, https://doi.org/10.1021/ acsmedchemlett.9b00651. [3] R. M. Karp, Reducibility among combinatorial problems, in: R. E. Miller, J. W. Thatcher and J. D. Bohlinger (eds.), Complexity of Computer Computations: Proceedings of a Symposium on the Complexity of Computer Computations, Springer US, Boston, MA, pp. 85–103, 1972, doi:10.1007/978-1-4684-2001-2_9, https://doi.org/10.1007/ 978-1-4684-2001-2_9. 8 Art Discrete Appl. Math. 8 (2025) #P1.08 [4] J. Konc and D. Janežič, An improved branch and bound algorithm for the maximum clique problem, MATCH Commun. Math. Comput. Chem. 58 (2007), 569–590. [5] J. Konc and D. Janežič, Probis algorithm for detection of structurally similar protein bind- ing sites by local structural alignment, Bioinformatics 26 (2010), 1160–1168, doi:10.1093/ bioinformatics/btq100, https://doi.org/10.1093/bioinformatics/btq100. [6] J. Konc and D. Janežič, Probis tools (algorithm, database, and web servers) for pre- dicting and modeling of biologically interesting proteins, Prog. Biophys. Mol. Biol. 128 (2017), 24–32, doi:10.1016/j.pbiomolbio.2017.02.005, https://doi.org/10.1016/j. pbiomolbio.2017.02.005. [7] J. Konc and D. Janežič, Protein binding sites for drug design, Biophys. Rev. 14 (2022), 1413–1421, doi:10.1007/s12551-022-01028-3, https://doi.org/10.1007/ s12551-022-01028-3. [8] J. Konc, S. Lesnik, B. Skrlj, M. Sova, M. Proj, D. Knez, S. Gobec and D. Janežič, ProBiS- Dock: a hybrid multitemplate homology flexible docking algorithm enabled by protein binding site comparison, J. Chem. Inf. Model. 62 (2022), 1573–1584, doi:10.1021/acs.jcim.1c01176, https://doi.org/10.1021/acs.jcim.1c01176. [9] K. Kores, J. Konc and U. Bren, Mechanistic insights into side effects of troglita- zone and rosiglitazone using a novel inverse molecular docking protocol, Pharmaceu- tics 13 (2021), 315, doi:10.3390/pharmaceutics13030315, https://doi.org/10.3390/ pharmaceutics13030315. [10] K. Kores, S. Lesnik, U. Bren, D. Janežič and J. Konc, Discovery of novel potential human targets of resveratrol by inverse molecular docking, J. Chem. Inf. Model. 59 (2019), 2467–2478, doi:10.1021/acs.jcim.8b00981, https://doi.org/10.1021/acs.jcim.8b00981. [11] S. Lešnik and U. Bren, Mechanistic insights into biological activities of polyphenolic com- pounds from rosemary obtained by inverse molecular docking, Foods 11 (2021), 67, doi: 10.3390/foods11010067, https://doi.org/10.3390/foods11010067. [12] R. Marino, L. Buffoni and B. Zavalnij, A short review on novel approaches for maximum clique problem: from classical algorithms to graph neural networks and quantum algorithms, 2024, arXiv:2403.09742 [math.CO]. [13] V. Ravnik, M. Jukič and U. Bren, Identifying metal binding sites in proteins using homologous structures, the MADE approach, J. Chem. Inf. Model. 63 (2023), 5204–5219, doi:10.1021/acs. jcim.3c00558, https://doi.org/10.1021/acs.jcim.3c00558. [14] K. Reba, M. Guid, K. Rozman, D. Janežič and J. Konc, Exact maximum clique algorithm for different graph types using machine learning, Mathematics 10 (2021), 97, doi:10.3390/ math10010097, https://doi.org/10.3390/math10010097. [15] K. Rozman, A. Ghysels, D. Janežič and J. Konc, An exact algorithm to find a maximum weight clique in a weighted undirected graph, Sci. Rep. 14 (2024), Id/No 9118, doi:10.1038/ s41598-024-59689-x, https://doi.org/10.1038/s41598-024-59689-x. [16] K. Rozman, A. Ghysels, B. Zavalnij, T. Kunej, U. Bren, D. Janežič and J. Konc, Enhanced molecular docking: Novel algorithm for identifying highest weight k-cliques in weighted gen- eral and protein-ligand graphs, J. Mol. Struct. 1304 (2024), 137639, doi:10.1016/j.molstruc. 2024.137639, https://doi.org/10.1016/j.molstruc.2024.137639. [17] E. Tomita and T. Seki, An efficient branch-and-bound algorithm for finding a maximum clique, in: International Conference on Discrete Mathematics and Theoretical Computer Science, Springer, 2003 pp. 278–289.