Informatica 40 (2016) 95–107 95 Evaluating the Dual Randomized Kaczmarz Laplacian Linear Solver Erik G. Boman Sandia National Laboratories1 Center for Computing Research USA E-mail: egboman@sandia.gov Kevin Deweese2 and John R. Gilbert2 UC Santa Barbara, Department of Computer Science, USA E-mail: kdeweese@cs.ucsb.edu and gilbert@cs.ucsb.edu Keywords: Laplacian solver, randomized Kaczmarz, cycle basis Received: May 28, 2015 A new method for solving Laplacian linear systems proposed by Kelner et al. involves the random sampling and update of fundamental cycles in a graph. Kelner et al. proved asymptotic bounds on the complexity of this method but did not report experimental results. We seek to both evaluate the performance of this approach and to explore improvements to it in practice. We compare the performance of this method to other Laplacian solvers on a variety of real world graphs. We consider different ways to improve the performance of this method by exploring different ways of choosing the set of cycles and the sequence of updates, with the goal of providing more flexibility and potential parallelism. We propose a parallel model of the Kelner et al. method, for evaluating potential parallelism in terms of the span of edges updated at each iteration. We provide experimental results comparing the potential parallelism of the fundamental cycle basis and our extended cycle set. Our preliminary experiments show that choosing a non-fundamental set of cycles can save significant work compared to a fundamental cycle basis. Povzetek: 1 Introduction 1.1 Graph Laplacians The Laplacian matrix of a weighted, undirected graph is defined as L = D − A, where D is the diagonal matrix containing the sum of incident edge weights and A is the weighted adjacency matrix. The Laplacian is symmetric and positive semidefinite. The Laplacian is also defined for directed graphs [2]. Because they are not symmetric, most of the solvers discussed in this paper do not apply, and ef- ficient solution techniques remain an open problem. Solv- ing linear systems on the Laplacians of structured graphs, such as two and three dimensional meshes, has long been important in finite element analysis (with applications in electrical and thermal conductivity, and fluid flow mod- eling [6]) and image processing (with applications in im- age segmentation, inpainting, regression, and classification [12, 20, 22]). More recently, solving linear systems on the graph Laplacians of large graphs, with irregular degree dis- tributions, has emerged as an important computational task in network analysis (with applications to maximum flow 1Sandia is a multi-program laboratory managed and operated by San- dia Corporation, a wholly owned subsidiary of Lockheed Martin Corpo- ration, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. 2Supported by Contract #618442525-57661 from Intel Corp. and Con- tract #DE-AC02-05CH11231 from DOE Office of Science. problems [8], graph sparsification [24], and spectral clus- tering [17]). Most applied work on Laplacian solvers has been on pre- conditioned conjugate gradient (PCG) solvers, including support graph preconditioners [4, 5, 13], or on specialized multigrid methods [20, 21]. Several of these methods seem efficient in practice, but none have asymptotic performance guarantees based on the size of the graph. The theoretical computer science community has devel- oped several methods, which we refer to generally as com- binatorial Laplacian solvers. These solvers have good com- plexity bounds but, in most cases, no reported experimental results. Spielman and Teng [25] first showed how to solve these problems in linear times polylogarithmic work, later improved upon by Koutis, Miller, and Peng [19], but their algorithms do not yet have a practical implementation. An algorithm proposed by Kelner et al. [16] has the potential to solve these linear systems in linear times polylogarithmic work with a simple, implementable algorithm. 1.2 The dual randomized Kaczmarz algorithm The inspiration for the algorithm proposed by Kelner et al. [16], which we refer to as Dual Randomized Kaczmarz (DRK), is to treat graphs as electrical networks with resis- tors on the edges. For each edge, the weight is the inverse 96 Informatica 40 (2016) 95–107 E.G. Boman et al. of the resistance. We can think of vertices as having an electrical potential and a net current at every vertex, and define vectors of these potentials and currents as v and f respectively. These vectors are related by the linear sys- tem Lv = f . Solving this system is equivalent to finding the set of voltages that satisfies the net “injected” currents. Kelner et al.’s DRK algorithm solves this problem with an optimization algorithm in the dual space, which finds the optimal currents on all of the edges subject to the constraint of zero net voltage around all cycles. They use Kaczmarz projections [15] to adjust currents on one cycle at a time, iterating until convergence. We will also refer to the Primal Randomized Kaczmarz (PRK) method that applies Kaczmarz projections in the pri- mal space [26]. One sweep of PRK performs a Kaczmarz projection with every row of the matrix. Rows are taken in random order at every sweep. DRK iterates over a set of fundamental cycles, cycles formed by adding individual edges to a spanning tree T . The fundamental cycles are a basis for the space of all cy- cles in the graph [11]. For each off-tree edge e, we define the resistance Re of the cycle Ce that is formed by adding edge e to the spanning tree as the sum of the resistances around the cycle, Re = ∑ e′∈Ce re′ which is thought of as approximating the resistance of the off-tree edge re. DRK chooses cycles randomly, with prob- ability proportional to Re/re. The performance of the algorithm depends on the sum of these approximation ratios, a property of the spanning tree called the tree condition number τ(T ) = ∑ e∈E\T Re re . The number of iterations of DRK is proportional to the tree condition number. Kelner et al. use a particular type of spanning tree with low tree condition number, called a low stretch tree. Specifically, they use the one described by Abraham and Neiman [1] with τ = O(m log n log log n), where n refers to the number of vertices and m refers to the number of edges of the original graph. The work of one iteration is naively the cycle length, but can be reduced to O(log n) with a fast data structure, yielding O(m log n2 log log n) total work. 1.3 Overview The rest of the paper is organized as follows. In Section 2 we survey the related experimental work. Section 3 is an initial evaluation of the DRK algorithm as compared to PCG and PRK. We present our new ideas for improving the performance of DRK in Section 4. In this section we also consider how to perform cycle updates in parallel. Section 5 is an evaluation of the new ideas proposed in Section 4. 2 Related experimental work As the DRK algorithm is a recent and theoretical result, there are few existing implementations or performance re- sults. Hoske et al. implemented the DRK algorithm in C++ and did timing comparisons against unpreconditioned CG on two sets of generated graphs [14]. They concluded that the solve time of DRK does scale nearly linearly. However, several factors make the running time too large in practice, including large tree stretch and cycle updates with unfa- vorable memory access patterns. They cite experimental results by Papp [23], which suggest that the theoretically low stretch tree algorithms are not significantly better than min-weight spanning trees in practice, at least on relatively small graphs. Chen and Toledo [7] experimented with an early and somewhat different combinatorial approach to Laplacians called support graph preconditioners. They demonstrated that support graph preconditioners can outperform incom- plete Cholesky on certain problems. There has also been some experimental work in implementing the local cluster- ing phase of the Spielman and Teng algorithm [27]. 3 Initial evaluation of DRK and comparison to PCG and PRK 3.1 Experimental design Our initial study of DRK measures performance in terms of work instead of time, and uses a somewhat more diverse graph test set than Hose et al. [14]. We implemented the algorithm in Python with Cython to see how it compared against PCG (preconditioned with Jacobi diagonal scaling) and PRK. However, we did not implement a low stretch spanning tree. Instead we use a low stretch heuristic that ranks and greedily selects edges by the sum of their inci- dent vertex degrees (a cheap notion of centrality). In prac- tice this works well on unweighted graphs. We also did not implement the fast data structure Kelner et al. use to update cycles in O(log n) work. Our results do not include wall clock time, since our DRK implementation is not highly optimized. Instead we are interested in measuring the total work. For PCG the work is the number of nonzeros in the matrix for every it- eration, plus the work of applying the preconditioner at ev- ery iteration (number of vertices for Jacobi). For PRK the work is the number of nonzero entries of the matrix for ev- ery sweep, where a sweep is a Kaczmarz projection against all the rows of L. As the DRK work will depend on data structures and implementation, we consider four different costs for estimating the work of updating a single cycle, which we refer to as cost metrics. The first metric is the cost of updating every edge in a cycle, which is included because it is the naive implementation we are currently us- ing. The second metric relies upon the data structure de- scribed by Kelner et al., which can update the fundamen- Evaluating the Dual Randomized Kaczmarz. . . Informatica 40 (2016) 95–107 97 tal cycles in O(log n) work. This may be an overestimate when the cycle length is actually less than log n. The third metric considers a hypothetical log of cycle length update method which we do not know to exist, but is included as a hopeful estimate of a potentially better update data struc- ture. The last metric costs O(1) work per cycle, which is included because we surely cannot do better than this. Metric 1. cycle length (naive) Metric 2. log n (using fast update data structure) Metric 3. log(cycle length) (optimistic) Metric 4. 1 (lower bound) We ran experiments on all the mesh-like graphs and ir- regular graphs shown in Appendix Table 1. Mesh-like graphs come from more traditional applications such as model reduction and structure simulation, and contain a more regular degree distribution. Irregular graphs come from electrical, road, and social networks, and contain a more irregular, sometimes exponential, degree distribution. Most of these graphs are in the University of Florida (UF) sparse matrix collection [10]. We added a few 2D and 3D grids along with a few graphs generated with the BTER generator [18]. We removed weights and in a few cases symmetricized the matrices by adding the transpose. We pruned the graphs to the largest connected component of their 2-core, by successively removing all degree 1 ver- tices, since DRK operates on the cycle space of the graph. The difference between the original graph and the 2-core is trees that are pendant on the original graph. These can be solved in linear time so we disregard them to see how solvers compare on just the structurally interesting part of the graph. We solve to a relative residual tolerance of 10−3. Ac- curacy in the solution is sacrified in order to run more ex- periments and on larger graphs. The Laplacian matrix is singular with a nullspace dimension of one (because the pruned graph is connected). For DRK and PRK this is not a problem, but for PCG we must handle the non-uniqueness of the solution. Our choice for handling singularity is to re- move the last row and column of the matrix. One could also choose to orthogonalize the solution against the nullspace inside the algorithm, but in our experience the performance results are similar. We also ran a set of PCG vs. DRK experiments where the convergence criteria is the actual error within 10−3. Ac- tual error can be calculated by knowing the solution in ad- vance. One of the interesting results of the DRK algorithm is that, unlike PCG and PRK, convergence does not depend on the condition number of the matrix, but instead just on the tree condition number. Since higher condition number can make small residuals less trustworthy, we wondered whether convergence in the actual error yields different re- sults. 3.2 Experimental results We compare DRK to the other solvers by examining the ra- tio of DRK work to the work of the other solvers. The ratio of DRK work to PRK work is plotted in Figure 1, separated by graph type. Each vertical set of four points are results for a single graph, and are sorted on the x axis by graph size. The four points represent the ratio of DRK work to PRK work under all four cost metrics. Points above the line indicate DRK performed more work while points below the line indicate DRK performed less work. Similar results for the PCG comparison are shown in Figure 2. Another set of PCG comparisons, converged to the actual error, is shown in Figure 3. An example of the convergence behavior on the USpow- erGrid graph is shown in Figure 4. This plot indicates how both the actual error and relative error behave during the solve for both PCG and DRK. A steeper slope indicates faster convergence. Note this only shows metric 1 work for DRK. 3.3 Experimental analysis In the comparison to PRK (shown in Figure 1), DRK is often better with cost metrics 3 and 4. On a few graphs, mostly in the irregular category, DRK outperforms PRK in all cost metrics (all the points are below the line). In the comparison to PCG (shown in Figure 2), DRK fares slightly better for the irregular graphs, but on both graph sets these results are somewhat less than promising. PCG often does better (most of the points are above the line). Even if we assume unit cost for cycle updates, PCG out- performs DRK. The performance ratios also seem to get worse as graphs get larger. The results concerning the actual error (shown in Figure 3) are very interesting as they are quite different than those with the residual tolerance. For all of the mesh graphs, con- sidering the actual error makes DRK look more promising. The relative performance of cost metrics 3 and 4 are now typically better for DRK than PCG. However, PCG is still consistently better with cost metrics 1 and 2. For some of the irregular graphs, the convergence behavior is similar, but for others things look much better when considering ac- tual error. Informally the number of edges updated by DRK did not change much when switching convergence criteria, but PCG work often increased. The USpowerGrid example (shown in Figure 4) gives a sense of this. The residual error and actual error decrease similarly for DRK, but the actual error curve for PCG decreases much more slowly for the actual error. 4 New algorithmic ideas We consider ways in which DRK could be improved by al- tering the choice of cycles and their updates. Our goals are both to reduce total work and to identify potential paral- lelism in DRK. To this end we are interested in measuring 98 Informatica 40 (2016) 95–107 E.G. Boman et al. 104 105 106 Graph Size (Edges) 10−2 10−1 100 101 To ta lD R K W or k To ta lP R K W or k Metric 1 Metric 2 Metric 3 Metric 4 (a) Mesh-like Graphs 103 104 105 106 Graph Size (Edges) 10−2 10−1 100 101 102 To ta lD R K W or k To ta lP R K W or k Metric 1 Metric 2 Metric 3 Metric 4 (b) Irregular Graphs Figure 1: DRK vs. PRK: Relative work of DRK to PRK work under the four cost metrics is shown (PRK is better than DRK at points above the line). 104 105 106 Graph Size (Edges) 100 101 102 103 To ta lD R K W or k To ta lP C G W or k Metric 1 Metric 2 Metric 3 Metric 4 (a) Mesh-like Graphs 103 104 105 106 Graph Size (Edges) 10−2 10−1 100 101 102 103 To ta lD R K W or k To ta lP C G W or k Metric 1 Metric 2 Metric 3 Metric 4 (b) Irregular Graphs Figure 2: DRK vs. PCG: Relative work of DRK to PCG work under the four cost metrics is shown (PCG is better than DRK at points above the line). 104 105 106 Graph Size (Edges) 10−1 100 101 102 To ta lD R K W or k To ta lP C G W or k Metric 1 Metric 2 Metric 3 Metric 4 (a) Mesh-like Graphs 103 104 105 106 Graph Size (Edges) 10−2 10−1 100 101 102 103 To ta lD R K W or k To ta lP C G W or k Metric 1 Metric 2 Metric 3 Metric 4 (b) Irregular Graphs Figure 3: DRK vs. PCG Converged to Actual Error: Relative work of DRK to PCG work under the four cost metrics is shown, convergence tolerance is norm of actual error within 10−3. Evaluating the Dual Randomized Kaczmarz. . . Informatica 40 (2016) 95–107 99 0 1 2 3 4 5 Edges Updated ×106 10−4 10−3 10−2 10−1 100 101 R el at iv e E rr or PCG Residual Error PCG Actual Error DRK Residual Error DRK Actual Error Figure 4: DRK and PCG Convergence Behavior on US- powerGrid: Relative residual error and actual error are shown for both solvers over the iterations required for con- vergence. the number of parallel steps, the longest number of steps a single thread would have to perform before before con- vergence, maximized over all threads. Parallel steps are measured in terms of the four cost metrics described in Section 2. We will also define the span [9], or critical path length, which is the number of parallel steps with un- bounded threads. 4.1 Expanding the set of cycles Sampling fundamental cycles with respect to a tree may re- quire updating several long cycles which will not be edge- disjoint. It would be preferable to update edge-disjoint cy- cles, as these updates could be done in parallel. The cycle set we use does not need to be a basis, but it does need to span the cycle space. In addition to using a cycle ba- sis from a spanning tree, we will use several small, edge- disjoint cycles. We expect that having threads update these small cycles is preferable to having them stand idle. 4.1.1 2D grid example A simple example of a different cycle basis is the 2D grid graph, shown in Figure 5. In the original DRK, cycles are selected by adding off-tree edges to the spanning tree as in Figure 5(a). As the 2D grid graph is planar, the faces of the grid are the regions bounded by edges, and we refer to the cycles that enclose these regions as facial cycles. We consider using these cycles to perform updates of DRK, as the facial cycles span the cycle space of a planar graph [11]. Half of these cycles can be updated at one iteration and then the other half can be updated during the next iteration, in a checkerboard fashion, as in Figures 5(b)(c). Furthermore, to speed up convergence, smaller cycles can be added to- gether to form larger cycles (in a multilevel fashion) as in Figure 5(d). (a) (b) (c) (d) Figure 5: Grid Cycles: (a) Fundamental cycles are formed by adding edges to the spanning tree. (b-c) First level facial cycles are shown, grouped into edge-disjoint sets. (d) Sec- ond level facial cycles are formed by adding smaller facial cycles. 101 102 103 104 105 Grid Size (Edges) 102 103 104 105 106 107 108 109 1010 E dg es U pd at ed Fundamental Cycles Face Cycles w/o Hierarchy Face Cycles with Hierarchy Work Span Figure 6: Grid Cycle Performance: Work and span of DRK using facial cycles and fundamental cycles for two dimen- sional grids of various sizes. We implemented such a cycle update scheme using the grid facial cycles, and performed experiments to see how the facial cycles affected the total work measured in both the number of cycles updated (metric 4) and edges updated (metric 1). With the facial cycles, the span per iteration is the cost of updating two cycles at each level. We ran experiments with and without the hierarchical combination of the facial cycles against the original set of fundamental cycles. In the case of the fundamental cycles we use H trees [3], which have optimal stretch O(log n). Solutions were 100 Informatica 40 (2016) 95–107 E.G. Boman et al. Algorithm 1 Local Greedy Finder. function LOCAL-GREEDY(G) for ei,j ∈ E do if ei,j unmarked then pi,j = Truncated-BFS(G \ (ei,j), i, j,max_edges) Add pi,j + ei,j to cycle set Mark all edges in pi,j + ei,j end if end for end function calculated to a residual tolerance of 10−6. The accuracy here is slightly better than the rest of the experiments since these experiments were faster. The results shown in Figure 6 indicate that the facial cy- cles improve both the work and span. Using a hierarchi- cal update scheme reduces the total number of edges up- dated. However as this requires updating larger cycles it has a worse span than simply using the lowest level of cy- cles. 4.1.2 Extension to general graphs We refer to small cycles we add to the basis as local greedy cycles. Pseudocode for finding these cycles is shown in Algorithm 1. We construct this cycle set by attempting to find a small cycle containing each edge using a trun- cated breadth-first search (BFS). Starting with all edges unmarked, the algorithm selects an unmarked edge and at- tempts to find a path between its endpoints. This search is truncated by bounding the number of edges searched so that each search is constant work and constructing the en- tire set is O(m) work. If found, this path plus the edge forms a cycle, which is added to the new cycle set, and all edges used are marked. Appendix Table 1 shows the num- ber of local greedy cycles found for all the test graphs when the truncated BFS was allowed to search 20 edges. Greedy cycles were found in all the graphs except for tube1, all of whose vertices had such high degree that searching 20 edges was not enough to find a cycle. Adding additional cycles to the cycle basis means we need new probabilities with which to sample all the cycles. Since in the unweighted case, the stretch of a cycle is just its total length, it seems natural to update cycles proportional to their length. 4.2 Cycle sampling and updating in parallel In the original DRK algorithm, cycles are chosen one at a time with probability proportional to stretch. We pro- pose a parallel update scheme in which multiple threads each select a cycle, at every iteration, with probability pro- portional to cycle length. If two threads select cycles that share an edge, one of the threads goes idle for that itera- tion. In Figure 8, threads 1, 2, and 4 select edge-disjoint cycles. However the third processor selects a cycle which Figure 7: Local Greedy Cycles: An edge is selected on the left and a local greedy search is performed to find the cycle on the right. contains edge 3, which is already in use by the cycle on thread 1. Processor 3 sits this iteration out while the other processors update their cycles. Thread 1 Thread 2 Thread 3 Thread 4 1 2 34 6 7 5 3 8 910 11 12 13 Figure 8: Example of Processors Selecting Cycles: Threads 1, 2, and 4 select edge-disjoint cycles, but thread 3 selects a cycle with edge 3 already in use. Thread 3 will go idle for an iteration. We compute several measures of parallel performance. The first is simply the number of iterations. The second is the total work across all threads at every iteration. Lastly we measure the span, or critical path length. This is the maximum of the work over all threads, summed over all the iterations. We envision threads working in a shared memory envi- ronment on a graph that fits in memory. This might not be realistic in practice as there must be some communication of which edges have already been used which might be too Evaluating the Dual Randomized Kaczmarz. . . Informatica 40 (2016) 95–107 101 expensive relative to the cost of a cycle update. However we are simply interested in measuring the potential paral- lelism, thus we ignore any communication cost. The parallel selection scheme conditions the probabili- ties with which cycles are selected on edges being available p(Ce) = 1 τ Re re p(e′ ∈ Ce available). This scheme creates a bias towards smaller cycles with less conflicting edges as more threads are added, which can in- crease total work. 5 Experiments and results 5.1 Experimental design We performed experiments on a variety of unweighted graphs from the UF Sparse Matrix Collection (the same set as in Section 2, shown in Appendix Table 1). Again we distinguish between mesh-like graphs and irregular graphs. We also use a small test set for weak scaling experiments, consisting of 2D grids and BTER graphs. We continue to use our Python/Cython implementation of DRK, without a guaranteed low stretch spanning tree or a cycle update data structure. The code does not run in par- allel, but we simulate parallelism on multiple threads by selecting and updating edge-disjoint cycles at every itera- tion as described above. Our experiments consist of two sets of strong scaling ex- periments, the spanning tree cycles with and without local greedy cycles, up to 32 threads. We set a relative resid- ual tolerance of 10−3. Again we sacrifice accuracy to run more experiments on larger graphs. We consider the same four cycle update cost metrics as in Section 2: cycle length, log n, log(cycle length), and unit cost update. However in the case of the local greedy cycles, which cannot use the log n update data structure, we always just charge the num- ber of edges in a cycle. For all the cost models, we measure the total work required for convergence and the number of parallel steps taken to converge. For metric 4 these will be the same. A condensed subset of the scaling results is shown in Appendix Table 2. 5.2 Experimental results First, we examine the effects of using an expanded cycle set in the sequential algorithm. We estimate the useful- ness of extra cycles as the length of the largest cycle in the fundamental set normalized by the number of cycles in the fundamental set. This is because we suspect the large cy- cles to be a barrier to performance, as they are updated the most frequently, and at the highest cost. The performance of the local greedy cycles for the two different graph types, using metrics 1 and 4, is shown in Figure 9. These plots show the ratio between the work of the expanded cycle sets as a function of the estimated usefulness. Points below the line indicate that adding local greedy cycles improved per- formance. We examine how the local greedy cycles perform as graph size increases with weak scaling experiments on 2D grid graphs and BTER graphs. The 2D grids used for this experiment are the same as in Figure 6, and the BTER graphs were generated with the parameters: average degree of 20, maximum degree of √ n, global clustering coeffi- cient of 0.15, and maximum clustering coefficient of 0.15. The performance of cost metric 1 as graph size scales is shown in Figure 10. Figure 11 shows examples of our results on three of the graphs. In Figure 11(a) the parallel steps (with the four different metrics) is plotted as a function of the number of threads used for the barth5 graph. The total edges (metric 1) is at the top of the plot, while the unit cost (metric 4) is at the bottom. These results are shown for both fundamen- tal and extended cycle sets. Figure 12 shows the effect of adding threads to the total work. To measure the parallel performance across multiple graphs we look at the average speedup of the parallel steps across all graphs. Speedup is defined as the sequential work using one thread over the number of parallel steps using a number of multiple threads. The speedup with and without extended cycles is shown in Figure 13. Note that without local greedy cycles metric 2 and metric 4 speedup are the same as the costs differ by log n. We compare the speedup between the different cycle sets for the different graph types in Figure 14. Results are shown only for metric 1. The speedup of using 8 threads without local greedy is plotted against the speedup of using 8 threads with local greedy. 5.3 Experimental analysis In the sequential results shown in Figure 9, there seems to be a threshold of largest cycle length above which local greedy cycles can be useful, but below which there is not much difference. However, there is not a clear scaling with the cycle length ratio, indicating that this is still a crude guess as to where the extended cycle set is useful. Also note that mesh-like graphs tend to have larger girth (max cycle length) than irregular graphs, leading to local greedy cycles working better on meshes. The local greedy cycle improvement is slightly better for metric 1 where we count every edge update. At the other extreme, when updating large cycles is the same cost (unit) as small cycles added by local greedy, the local cycles are less effective. However there is still an improvement in number of cycles updated. We were unable to find some measure of the usefulness of a single local greedy cycle. The weak scaling experiments shown in Figure 10 show that, with the exception of a BTER outlier, the work scales nicely with graph size. Also results for both cycle sets scale similarly. The extended cycle set benefits the 2D grid graphs while the BTER graphs see little improvement or are worse. This is consistent with Figure 9 since the BTER graphs are irregular and the 2D grids are mesh-like. 102 Informatica 40 (2016) 95–107 E.G. Boman et al. 10−5 10−4 10−3 10−2 10−1 Largest Fundamental Cycle Number of Fundamental Cycles .5 1 1.1 D R K S eq ue nt ia lW or k w ith Lo ca lG re ed y D R K S eq ue nt ia lW or k w ith ou t Irregular Graphs Mesh-like Graphs (a) Metric 1 10−5 10−4 10−3 10−2 10−1 Largest Fundamental Cycle Number of Fundamental Cycles .5 1 1.1 D R K S eq ue nt ia lW or k w ith Lo ca lG re ed y D R K S eq ue nt ia lW or k w ith ou t Irregular Graphs Mesh-like Graphs (b) Metric 4 Figure 9: Sequential Comparison of Cycle Set Work: The ratio of DRK work with and without local greedy cycles, on one thread, is plotted against an estimate of the usefulness of extra cycles. Points below the line indicate that adding local greedy cycles helped. 101 102 103 104 105 Vertices 103 104 105 106 107 108 109 D R K S eq ue nt ia lW or k Without Local Greedy With Local Greedy (a) 2D Grid 102 103 104 105 Vertices 106 107 108 D R K S eq ue nt ia lW or k Without Local Greedy With Local Greedy (b) BTER Figure 10: Weak Scaling of Cycle Set Work Under Cost Metric 1: The DRK work with and without local greedy cycles, on one thread, is plotted against the graph size in vertices. 100 101 Threads 105 106 107 108 109 P ar al le lS te ps (a) barth5 100 101 Threads 105 106 107 108 109 P ar al le lS te ps (b) tuma1 100 101 Threads 104 105 106 107 P ar al le lS te ps Without Local Greedy With Local Greedy Metric 1 Metric2 Metric 3 Metric 4 (c) email Figure 11: Parallel Steps Scaling (shown for three example graphs): As threads are added, parallel steps decreases for both cycle sets (steeper slope indicates better scaling). Evaluating the Dual Randomized Kaczmarz. . . Informatica 40 (2016) 95–107 103 100 101 Threads 105 106 107 W or k Without Local Greedy With Local Greedy Metric 1 Metric2 Metric 3 Metric 4 Figure 12: Total Work Scaling of email Graph: As threads are added the total work increases for both cycle sets (ideally it would stay constant). 0 5 10 15 20 25 30 35 Threads 1 2 3 4 5 6 7 8 S pe ed up (a) Mesh-like Graphs 0 5 10 15 20 25 30 35 Threads 1 2 3 4 5 6 7 S pe ed up Without Local Greedy With Local Greedy Metric 1 Metric 2 Metric 3 Metric 4 (b) Irregular Graphs Figure 13: Average Parallel Steps Speedup: The ratio of sequential work on one thread to parallel steps on multiple threads is plotted up to 32 threads. 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Speedup with Local Greedy 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 S pe ed up w ith ou tL oc al G re ed y Irregular Graphs Mesh-like Graphs Figure 14: 8 Thread Speedup Comparison: The ratio of the 8 thread speedups of both cycle sets are plotted for all graphs (below the line local greedy speedup is better). The scaling of parallel steps plots show a variety of dif- ferent behavior on the example graphs. On the mesh-like barth5 graph (shown in Figure 11(a)), the local greedy cy- cles improve both sequential performance and the scaling of parallel steps performance. At the left of this plot we see the extra cycles improve sequential results. Then as threads are added in parallel, the steeper slope indicates the local greedy cycles improved the scaling of the parallel steps. On the tuma1 graph (shown in Figure 11(b)), the local greedy cycles improve sequential performance, but result in simi- lar or worse scaling. At the left of this plot we see the extra cycles improve results sequentially, but when scaled to 32 104 Informatica 40 (2016) 95–107 E.G. Boman et al. threads performance is similar. On the email graph (shown in Figure 11(c)), the local greedy cycles do not improve se- quential performance, and scaling is poor with both cycle sets. There is little difference between the different cycle sets in this plot. Furthermore scaling is poor and quickly flattens out by about four threads. For a better understand- ing of the poor parallel steps scaling on the email graph, we examine the total work scaling (shown in Figure 12), show- ing how much extra work we have to do when skewing the probability distribution. This extra work quickly increases, limiting the parallel performance. In the average parallel steps speedup plot (shown in Fig- ure 13), we see similar speedup for both cycle sets. On mesh-like graphs the local greedy cycles do slightly bet- ter on all cost metrics beyond 16 threads. However on the irregular graphs, only with cycle cost metric 4 do the lo- cal greedy cycles perform better, and under metric 3 they perform worse. (Again note that without local greedy cy- cles metric 4 and metric 3 speedups are the same). We hypothesized that giving the solver smaller, extra cycles would improve the parallel performance compared to the fundamental cycles. However this seems to only be true for mesh-like graphs, and even then the improvement is minimal. An interesting thing to note is that the speedup is better with the log n cost model. This is probably due to overcharging small cycles, which is less problematic when there are more threads to pick potentially larger cycles. Taking a snapshot of the parallel steps speedup results on eight threads (shown in Figure 14), we see that there are some irregular graphs which do not have much speedup for either cycle set (bottom left of the plot). However there are mesh-like and irregular graphs which enjoy a speedup for both cycle sets (top right of the plot). It is difficult to say on which graphs will different cycles aid with parallelism. 6 Conclusion We have done an initial comparison of Kelner et al.’s DRK algorithm with PCG and PRK. These preliminary results, measuring algorithm work by number of edges touched or by number of cycles updated, do not at present support the practical utility of DRK. For mesh-like graphs, PCG usu- ally takes less work than DRK, even if DRK is charged only one unit of work per cycle update. This suggests that the fast cycle update data structure proposed by Kelner et al. (or any undiscovered fast update method) will not be enough to make DRK practical. It does seem that DRK is an improvement to PRK on several graphs, mostly irregular graphs. One promising result of these experiments is that DRK converges to small actual error similarly to residual error, while PCG sometimes does not. More PCG iterations are required when solving to a low actual error, while DRK work does not increase very much. More work should be done to understand this behavior. The experiments in this paper were limited to un- weighted graphs for simplicity. Experiments with weighted graphs should be run for more complete results. An open question is whether there is a class of graphs with high con- dition number, but with practical low stretch trees, where DRK will perform significantly better in practice. We suggest techniques for improving DRK in practice. One possible improvement is to use a spanning set, includ- ing non-fundamental cycles, to accelerate convergence. Using facial cycles of a two-dimensional grid graph greatly reduces the required number of edge updates compared to the fundamental cycle basis. We try to generalize these cy- cles by finding small local greedy cycles. These cycles can accelerate convergence, especially for mesh-like graphs. It is difficult to measure the usefulness of any one cycle in the basis, so it is difficult to determine where and which extra cycles are useful. We also consider how DRK could be implemented in parallel to take advantage of simultaneous updates of edge disjoint cycles. We describe a model in which threads se- lect cycles, and go idle if a conflicting edge is found. While this can increase total work, it can often reduce the number of parallel steps. However there is a limit to this paral- lelism. Furthermore, scaling behavior seems to be similar with or without local greedy cycles. References [1] I. Abraham and O. Neiman (2012) Using petal- decompositions to build a low stretch spanning tree, Proceedings of the 44th annual ACM Symp. on The- ory of Comp., ACM, New York, NY, USA, pp. 395– 406. [2] R. Agaev and P. Chebotarev (2005) On the spectra of nonsymmetric Laplacian matrices, Linear Algebra and its Appl., Elsevier, pp. 157–168. [3] N. Alon, M. Karp, D. Peleg, and D. West (1995) A graph-theoretic game and its application to the k- server problem, SIAM Journal on Comp., SIAM, pp. 78–100. [4] M. Bern, J. Gilbert, B. Hendrickson, N. Nguyen, and S. Toledo (2006) Support-graph preconditioners, SIAM Journal Matrix Anal. Appl., SIAM, pp. 930– 951. [5] E. G. Boman, D. Chen, B. Hendrickson, and S. Toledo (2004) Maximum-weight-basis precondition- ers, Numerical Linear Algebra Appl., Wiley, pp. 695– 721. [6] E. G. Boman, B. Hendrickson, and S. Vavasis (2008) Solving elliptic finite element systems in near-linear time with support preconditioners, SIAM Journal on Numerical Anal., SIAM, pp. 3264–3284. [7] D. Chen and S. Toledo (2003) Vaidya’s precondition- ers: Implementation and experimental study, Elec- Evaluating the Dual Randomized Kaczmarz. . . Informatica 40 (2016) 95–107 105 tronic Transactions on Numerical Anal. [electronic only], pp. 30–49. [8] P. Christiano, J. A. Kelner, A. Madry, D. A. Spiel- man, and S.-H. Teng (2011) Electrical flows, Lapla- cian systems, and faster approximation of maximum flow in undirected graphs, Proceedings of the 43rd annual ACM Symp. on Theory of Comp., ACM, San Jose, CA, USA, pp. 273–282. [9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2009) Introduction to Algorithms, MIT Press and McGraw-Hill. [10] T. A. Davis and Y. Hu (2011) The University of Florida sparse matrix collection, ACM Transactions on Mathematical Software, ACM, pp. 1:1–1:25. [11] R. Diestel (2012) Graph Theory, Springer. [12] L. Grady (2006) Random walks for image segmen- tation, IEEE Transactions on Pattern Anal. and Ma- chine Intelligence, IEEE, pp. 1768–1783. [13] K. Gremban (1996) Combinatorial Preconditioners for Sparse, Symmetric, Diagonally Dominant Lin- ear Systems, PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA. [14] D. Hoske, D. Lukarski, H. Meyerhenke, and M. Weg- ner (2015) Is nearly-linear the same in theory and practice? A case study with a combinatorial Lapla- cian solver, Computing Research Repository [elec- tronic only], http://arxiv.org/abs/1502.07888. [15] S. Kaczmarz (1937) Angenäherte auflösung von sys- temen linearer gleichungen, Bulletin International de l’Academie Polonaise des Sciences et des Lettres, pp. 355–357. [16] J. A. Kelner, L. Orecchia, A. Sidford, and Z. A. Zhu (2013) A simple, combinatorial algorithm for solv- ing SDD systems in nearly-linear time, Proceedings of the 45th ACM Symp. Theory of Comp., ACM, Palo Alto, CA, USA, pp. 911–920. [17] N. L. D. Khoa and S. Chawla (2015) A scalable ap- proach to spectral clustering with SDD solvers, Jour- nal of Intelligent Info. Sys., Springer, pp. 289–308. [18] T. G. Kolda, A. Pinar, T. Plantenga, and C. Seshadhri (2014) A scalable generative graph model with com- munity structure, SIAM Journal on Scientific Comp., SIAM, pp. C424–C452. [19] I. Koutis, G. L. Miller, and R. Peng (2014) Approach- ing optimality for solving SDD systems, SIAM Jour- nal on Comp., SIAM, pp. 337–354. [20] I. Koutis, G. L. Miller, and D. Tolliver (2011) Com- binatorial preconditioners and multilevel solvers for problems in computer vision and image processing, Computer Vision and Image Understanding, Elsevier, pp. 1638–1646. [21] O. E. Livne and A. Brandt (2012) Lean alge- braic multigrid (LAMG): Fast graph Laplacian linear solver, SIAM Journal on Scientific Comp., SIAM, pp. B499–B522. [22] J. McCann and N. S. Pollard (2008) Real-time gradient-domain painting, ACM Transactions on Graphics, ACM, pp. 93:1–93:7. [23] P. A. Papp (2014) Low-Stretch Spanning Trees, Un- dergraduate thesis, Eötvös Loránd University, Bu- dapest, Hungary. [24] D. A. Spielman and N. Srivastava (2011) Graph spar- sification by effective resistances, SIAM Journal on Comp., SIAM, pp. 1913–1926. [25] D. A. Spielman and S. Teng (2004) Nearly-linear time algorithms for graph partitioning, graph sparsifica- tion, and solving linear systems, In Proceedings of the 36th Annual ACM Symp. on Theory of Comp., ACM, New York, NY, USA, pp. 81–90. [26] T. Strohmer and R. Vershynin (2009) A randomized Kaczmarz algorithm with exponential convergence, Journal of Fourier Anal. and Appl., Springer, pp. 262–278. [27] Z. A. Zhu, S. Lattanzi, and V. S. Mirrokni (2013) A local algorithm for finding well-connected clusters, Proceedings of the 30th International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, Atlanta, GA, USA, pp. 396–404. 106 Informatica 40 (2016) 95–107 E.G. Boman et al. Graph Nodes Edges 2-core 2-Core Greedy Probability of Largest (Collection) Nodes Edges Cycles Selecting Greedy Cycle Length jagmesh3 (HB) 1.09k 3.14k 1.09k 3.14k 1.92k 0.2419 77 lshp1270 (HB) 1.27k 3.70k 1.27k 3.70k 2.17k 0.4712 95 rail_1357 1.36k 3.81k 1.36k 3.81k 1.85k 0.2507 55 (Oberwolfach) 50 x 50 grid 2.50k 4.90k 2.50k 4.90k 2.40k 0.5000 120 data (DIMACS10) 2.85k 15.1k 2.85k 15.1k 7.43k 0.1760 92 100 x 100 grid 10.0k 19.8k 10.0k 19.8k 9.80k 0.5000 230 20 x 20 x 20 grid 8.00k 22.8k 8.00k 22.8k 3.57k 0.1941 122 L-9 (A-G Monien) 18.0k 35.6k 18.0k 35.6k 17.6k 0.4992 411 tuma1 23.0k 37.2k 22.2k 36.5k 10.7k 0.0610 420 (GHS_indef) barth5 (Pothen) 15.6k 45.9k 15.6k 45.9k 29.9k 0.1765 375 cti (DIMACS10) 16.8k 48.2k 16.8k 48.2k 7.27k 0.0501 172 aft01 (Okunbor) 8.21k 58.7k 8.21k 58.7k 26.6k 0.6680 105 30 x 30 x 30 grid 27.0k 78.3k 27.0k 78.3k 8.35k 0.1399 202 wing (DIMACS10) 62.0k 122k 62.0k 122k 27.9k 0.0301 605 olesnik0 88.3k 342k 88.3k 342k 220k 0.1327 363 (GHS_indef) tube1 (TKK) 21.5k 438k 21.5k 438k 0 0.0000 102 fe_tooth (DIMACS10) 78.1k 453k 78.1k 453k 217k 0.3673 286 dawson5 (GHS_indef) 51.5k 480k 20.2k 211k 19.8k 0.0941 165 (a) Mesh-like Graphs Graph Nodes Edges 2-core 2-core Greedy Probability of Largest (Collection) Nodes Edges Cycles Selecting Greedy Cycle Length EVA (Pajek) 8.50k 6.71k 314 492 84 0.2346 18 bcspwr09 (HB) 1.72k 2.40k 1.25k 1.92k 651 0.3276 54 BTER1 981 4.85k 940 4.82k 510 0.0465 18 davg = 10, dmax = 30 ccmax = .3, ccglobal = .1 USpowerGrid (Pajek) 4.94k 6.59k 3.35k 5.01k 1.68k 0.2997 80 email (Arenas) 1.13k 5.45k 978 5.30k 362 0.0433 11 uk (DIMACS10) 4.82k 6.84k 4.71k 6.72k 1.97k 0.2488 211 as-735 (SNAP) 7.72k 13.9k 4.02k 10.1k 3.83k 0.0822 9 ca-GrQc (SNAP) 4.16 13.4k 3.41k 12.7k 4.43k 0.2315 22 BTER2 4.86k 25.1k 4.54k 24.8k 2.69k 0.0468 17 davg = 10, dmax = 70 ccmax = .3, ccglobal = .1 gemat11 (HB) 4.93k 33.1k 4.93k 33.1k 9.72k 0.0011 42 BTER3 4.94k 37.5k 4.66k 37.2k 4.79k 0.0518 18 davg = 15, dmax = 70 ccmax = .6, ccglobal = .15 dictionary28 (Pajek) 52.7k 89.0k 20.9k 67.1k 20.2k 0.1410 36 astro-ph (SNAP) 16.7k 121k 11.6k 111k 13.2k 0.0786 18 cond-mat-2003 31.2k 125k 25.2k 114k 32.5k 0.1533 23 (Newman) BTER4 999 171k 999 171k 33 0.0002 7 davg = 15, dmax = 30 ccmax = .6, ccglobal = .15 HTC_336_4438 (IPSO) 226k 339k 64.1k 192k 32.9k 0.0339 990 OPF_10000 (IPSO) 43.9k 212k 42.9k 211k 122k 0.3146 53 ga2010 (DIMACS10) 291k 709k 282k 699k 315k 0.1466 941 coAuthorsDBLP 299k 978k 255k 934k 297k 0.1524 36 (DIMACS10) citationCiteseer 268k 1.16M 226k 1.11M 150k 0.0484 56 (DIMACS10) (b) Irregular Graphs Appendix Table 1: Statistics of All Graphs Used in Experiments. Evaluating the Dual Randomized Kaczmarz. . . Informatica 40 (2016) 95–107 107 Graph Sequential Work 2 Thread Parallel Steps 8 Thread Parallel Steps and Metric (with Local Greedy) (with Local Greedy) (with Local Greedy) jagmesh3 (Metric 1) 2.73M (1.72M) 2.02M (1.26M) 1.07M (28.3K) jagmesh3 (Metric 4) 127K (101K) 69.7K (51.2K) 632K (17.4K) lshp1270 (Metric 1) 6.80M (4.35M) 4.87M (3.50M) 3.41M (2.29M) lshp1270 (Metric 4) 192K (150K) 104K (85.1K) 61.0K (41.9K) rail_1357 (Metric 1) 1.64M (1.26M) 1.21M (909K) 653K (550K) rail_1357 (Metric 4) 119K (114K) 63.8K (57.0K) 143K (143K) 50 x 50 grid (Metric 1) 9.42M (4.32M) 9.42M (4.43M) 3.55M (1.50M) 50 x 50 grid (Metric 4) 213K (125K) 115K (62.5K) 45.0K (20.0K) data (Metric 1) 17.9M (16.4M) 13.1M (13.1M) 8.43M (7.41M) data (Metric 4) 815K (878K) 416K (470K) 185K (168K) 100 x 100 grid (Metric 1) 84.0M (38.1M) 59.7M (29.1M) 27.2M (13.1M) 100 x 100 grid (Metric 4) 1.11M (610K) 560K (310K) 180K (90.0K) 20 x 20 x 20 grid (Metric 1) 64.7M (63.3M) 45.3M (46.0M) 30.9M (28.6M) 20 x 20 x 20 grid (Metric 4) 1.55M (1.61M) 816K (856K) 448K (416K) L-9 (Metric 1) 820M (382M) 557M (266M) 346M (124M) L-9 (Metric 4) 3.92M (2.03M) 2.07M (1.04M) 1.10M (396K) tuma1 (Metric 1) 597M (282M) 362M (186M) 147M (75.9M) tuma1 (Metric 4) 3.36M (1.69M) 1.60M (845K) 512K (267K) barth5 (Metric 1) 282M (149M) 212M (119M) 118M (54.9M) barth5 (Metric 4) 3.11M (1.98M) 1.61M (1.03M) 655K (312K) cti (Metric 1) 204M (195M) 142M (143M) 87.7M (103M) cti (Metric 4) 3.87M (3.89M) 2.02M (2.09M) 1.01M (1.20M) aft01 (Metric 1) 127M (118M) 90.8M (88.3M) 45.4M (43.5M) aft01 (Metric 4) 4.38M (4.32M) 2.19M (2.22M) 763k (738k) 30 x 30 x 30 grid (Metric 4) 401M (394M) 284M (283M) 152M (142M) 30 x 30 x 30 grid (Metric 4) 6.51M (6.62M) 3.35M (3.40M) 1.35M (1.27M) wing (Metric 1) 4.98B (4.16B) 3.41B (2.99B) 2.58B (2.08B) wing (Metric 4) 21.8M (18.8M) 12.2M (10.8M) 8.31M (6.70M) olesnik0 (Metric 1) 2.69B (1.71B) 1.99B (1.36B) 978M (630M) olesnik0 (Metric 4) 33.8M (24.6M) 16.9M (1.28M) 5.47M (3.62M) tube1 (Metric 1) 2.74B (N/A) 1.98B (N/A) 1.59B (N/A) tube1 (Metric 4) 56.1M (N/A) 32.0M (N/A) 22.9M (N/A) fe_tooth (Metric 1) 6.56B (5.76B) 4.46B (4.09B) 3.04B (2.87B) fe_tooth (Metric 4) 65.5M (62.1M) 34.3M (32.7M) 19.8M (18.8M) dawson5 (Metric 1) 1.49B (1.45B) 1.06B (1.05B) 650M (658M) dawson5 (Metric 4) 24.9M (24.7M) 12.8M (12.8M) 6.11M (6.19M) (a) Mesh-like Graphs Graph Sequential Work 2 Thread Parallel Steps 8 Thread Parallel Steps and Metric (with Local Greedy) (with Local Greedy) (with Local Greedy) EVA (Metric 1) 41.4K (25.1K) 22.5K (23.8K) 18.5K (15.1K) EVA (Metric 4) 6.28K (4.08K) 2.83K (3.14K) 1.88K (1.88K) bcspwr09 (Metric 1) 308K (242K) 199K (165K) 130K (115K) bcspwr09 (Metric 4) 26.2K (24.9K) 12.5K (12.5K) 4.99K (4.99K) BTER1 (Metric 1) 2.26M (2.25M) 1.78M (1.76M) 1.59M (1.64M) BTER1 (Metric 4) 246K (253K) 243K (253K) 333K (397K) USpowerGrid (Metric 1) 1.06M (887K) 735K (557K) 297K (279K) USpowerGrid (Metric 4) 73.8K (77.1K) 36.9K (33.5K) 10.1K (10.1K) email (Metric 1) 1.22M (1.23M) 871K (862K) 639K (638K) email (Metric 4) 199K (204K) 127K (127K) 87.0K (87.0K) uk (Metric 1) 7.30M (3.88M) 5.51M (3.29M) 3.04M (1.62M) uk (Metric 4) 160K (108K) 84.8K (61.2K) 33.0K (18.8K) as-735 (Metric 1) 911K (887K) 498K (550K) 293K (267K) as-735 (Metric 4) 201K (201K) 96.6K (109K) 48.3K (44.2K) ca-GrQc (Metric 1) 2.98M (3.39M) 1.92M (2.16M) 923K (950K) ca-GrQc (Metric 4) 413K (509K) 201K (242K) 71.7K (75.1K) BTER2 (Metric 1) 11.3M (11.7M) 8.21M (8.31M) 6.52M (6.60M) BTER2 (Metric 4) 1.30M (1.39M) 876K (894K) 658K (667K) gemat11 (Metric 1) 63.7M (62.2M) 50.4M (49.4M) 45.7M (44.9M) gemat11 (Metric 4) 3.30M (3.22M) 2.38M (2.33M) 2.07M (2.04M) BTER3 (Metric 1) 18.5M (19.0M) 14.6M (14.1M) 13.4M (12.4M) BTER3 (Metric 4) 2.13M (2.27M) 1.59M (1.54M) 1.39M (1.29M) dictionary28 (Metric 1) 38.0M (33.8M) 24.4M (24.6M) 16.4M (15.2M) dictionary28 (Metric 4) 3.20M (3.18M) 1.65M (1.78M) 941K (878K) astro-ph (Metric 1) 25.3M (25.9M) 15.7M (16.3M) 8.63M (8.57M) astro-ph (Metric 4) 4.45M (4.74M) 2.27M (2.43M) 1.01M (1.01M) cond-mat-2003 (Metric 1) 38.8M (40.9M) 27.4M (30.3M) 21.0M (20.5M) cond-mat-2003 (Metric 4) 4.57M (5.35M) 2.62M (3.10M) 1.74M (1.72M) BTER4 (Metric 1) 25.4M (26.4M) 15.1M (15.1M) 8.53M (8.01M) BTER4 (Metric 4) 6.78M (7.06M) 3.67M (3.67M) 1.84M (1.73M) HTC_336_4438 (Metric 1) 14.2B (13.7B) 8.50B (8.89B) 4.10B (3.67B) HTC_336_4438 (Metric 4) 32.2M (32.0M) 15.9M (16.9M) 6.80M (6.09M) OPF_10000 (Metric 1) 47.3M (49.1M) 31.8M (33.2M) 16.0M (16.5M) OPF_10000 (Metric 4) 6.86M (8.66M) 3.34M (4.29M) 857K (1.07M) ga2010 (Metric 1) 8.77B (6.16B) 6.88B (4.97B) 3.21B (2.59B) ga2010 (Metric 4) 40.8M (33.5M) 20.8M (16.9M) 6.48M (5.35B) coAuthorsDBLP (Metric 1) 539M (552M) 356M (371M) 264M (237M) coAuthorsDBLP (Metric 4) 44.4M (51.3M) 23.5M (26.3M) 15.3M (13.8M) citationCiteseer (Metric 1) 1.08B (1.07B) 716M (723M) 506M (482M) citationCiteseer (Metric 4) 74.4M (76.2M) 40.7M (41.8M) 25.3M (24.2M) (b) Irregular Graphs Appendix Table 2: Condensed Results of Scaling Experiments.