Fast Execution of Simultaneous Breadth-First Searches on Sparse Graphs

Adam McLaughlin
School of Electrical and Computer Engineering
Georgia Institute of Technology
Atlanta, Georgia 30332–0250
Adam27X@gatech.edu

David A. Bader
School of Computational Science and Engineering
Georgia Institute of Technology
Atlanta, Georgia 30332–0250
bader@cc.gatech.edu

Abstract—The construction of efficient parallel graph algorithms is important for quickly solving problems in areas such as urban planning, social network analysis, and hardware verification. Existing GPU implementations of graph algorithms tend to be monolithic and thus contributions from the literature are typically rebuilt rather than reused. Recent work has focused on traversal-based abstractions that efficiently execute a single breadth-first search or enact algorithms in the “think like a vertex” paradigm. However, graph analytics such as the all-pairs shortest paths problem, diameter computations, betweenness centrality, and reachability querying require the execution of many such graph traversals. Typically, these traversals are independent of one another and can thus be executed in parallel.

This paper presents multi-search, a simple abstraction that is designed for graph algorithms requiring many breadth-first searches that can be executed simultaneously. Although algorithms have implicitly leveraged this abstraction in the past, we provide an explicit, reusable implementation that efficiently maps this abstraction to the GPU, performing more than twice as fast as previous approaches on large graphs of varying diameter. This approach allows us to scale our APSP implementation to graphs with millions of vertices using a single GPU whereas prior approaches were either constrained to much smaller graph instances or required large supercomputers to process graphs of similar size. To show the flexibility of our abstraction, we use it to express betweenness centrality and achieve more than a 5.82x average speedup over parallel CPU implementations from existing frameworks and a 2.24x average speedup over a manual, highly optimized GPU implementation of the algorithm.

I. INTRODUCTION

Graph analysis is a useful abstraction that can be applied to a variety of problems including epidemiology [28], hardware verification [19], and the modeling of physical phenomena [6]. Real world graph analytics require both scalability to large graph instances and high-performance implementations. Plenty of individual graph algorithms have been successfully accelerated using GPUs [9], [32], [33]; however, these manual implementations tend to make code reuse difficult compared to their CPU counterparts. Code reuse is tremendously important, because its absence results in tremendous effort spent duplicating and extracting work that has already been completed. Attempts to solve this issue have presented a number of abstractions that fit certain classes of graph algorithms. The Gather-Apply-Scatter (GAS) abstraction, for example, uses a generalized addition operator to pull in and accumulate information from neighboring vertices (gather), update active vertices (apply), and propagate updates to neighboring vertices (scatter) [17]. Traversal-based abstractions tend to hide the performance-sensitive details of graph traversal while requiring the user to provide application-specific functions for visiting vertices, edges, or sets of vertices or edges [34], [39], [42]. There has also been a significant effort to standardize classical graph algorithms in the context of linear algebra [24]. These abstractions tend to focus on the efficient execution of a single breadth-first search, so for problems requiring many such searches, these approaches miss out on opportunities for coarse-grained parallelism and thus, additional performance gains.

In this study we focus on graph algorithms requiring many breadth-first searches that can be executed independently in parallel. This focus isn’t contrived; many analytics require searching from several (or even all) vertices in a graph for the purpose of path-counting or analyzing a graph from multiple perspectives. For example, querying which sets of vertices can reach one another can be implemented through a series of breadth-first searches, one from each vertex in the set. The All-Pairs Shortest Path (APSP) problem finds the length of the shortest path between every pair of vertices, which has been used to trace routes in transportation networks [36]. Betweenness centrality, a popular analytic for determining the most influential vertices in a graph, builds upon the APSP problem and thus also fits into this paradigm. The diameter of a graph as well as other useful graph metadata can also be determined from the solution of these problems.

We present the multi-search abstraction, which is a simple methodology for formulating algorithms that execute many simultaneous breadth-first searches. By providing a small number of typically short functions, the user can define his or her own algorithms that leverage this abstraction and our efficient implementation. We consider the multi-search abstraction to be a complement rather than a replacement for GAS and traversal-based paradigms. Although existing paradigms can be used to implement algorithms fitting the multi-search abstraction, we show that taking advantage of coarse-grained parallelism offered by performing many BFSs at once leads to better performance.

Based on the above, this paper presents the following contributions:

- We present the multi-search abstraction, a simple, yet efficient methodology for expressing algorithms that execute many graph traversals.
• We provide an efficient, cooperative implementation of this abstraction, and show that it outperforms existing implicit GPU methods by greater than a factor of 2 for large graphs of varying diameter.

• Using our abstraction, we show that a single GPU can be used to solve the APSP problem on sparse graphs with millions of vertices whereas prior art required large distributed systems to scale to graphs of similar size.

• We implement betweenness centrality using our abstraction and show more than a 5.82x average speedup over existing parallel CPU frameworks, a 3.07x average speedup over an existing GPU framework, and a 2.24x average speedup over a manual, heavily optimized GPU implementation for a diverse set of graphs.

II. BACKGROUND

A. Terminology

A graph $G = (V, E)$ is an abstract representation of data consisting of a set of vertices $V$ and a set of edges $E$ connecting pairs of vertices. Let $n = |V|$ be the number of vertices and $m = |E|$ be the number of edges in the graph. The work in this paper focuses on graphs with uniform edge weight, but can be extended to handle graphs with arbitrary edge weights. For simplicity, we consider graphs with undirected edges here, noting that our implementation can handle graphs with either directed or undirected edges. An undirected edge between vertices $u$ and $v$ can be represented as two directed edges, one from $u$ to $v$ and the other from $v$ to $u$.

A path $p$ from a vertex $u$ to a vertex $v$ is a set of edges starting at $u$ and ending at $v$. The degree of a vertex $u$ is the number of edges incident to $u$ or the number of vertices neighboring $u$. The diameter of a graph is the length of the longest shortest path between any two vertices. The vertices of a scale-free graph exhibit a power-law degree distribution such that a small number of vertices have a large number of neighbors and a large number of vertices have a small number of neighbors [2]. Graphs exhibiting the small world phenomenon (also known as six degrees of separation) have a diameter that is logarithmic in the number of vertices [43]. Such graphs are often, but not necessarily, scale-free. Finally, a vertex frontier is a subset of vertices that are currently active during an iteration of a graph traversal and an edge frontier is the set of outgoing edges from the current vertex frontier.

B. The Multi-Search Abstraction

At a high-level, the multi-search abstraction fits any problem that requires multiple, independent breadth-first searches. We consider it a generalization of traversal-based approaches, which abstract the details of graph traversals from the operations that need to be performed on vertex frontiers. This abstraction allows domain experts in parallel algorithm design to construct the performance sensitive code that handles graph traversals and allows end users to write a small number of functions that are applied to the active vertices at each level. The end user may still need to be aware of some details of parallel programming, such as when atomic operations are necessary to avoid race conditions, but their code is typically concise and not substantially performance sensitive.

The users of our abstraction declare the set of vertices in the graph that traversals are to be executed from in addition to defining the functions that would be used in a standard traversal-based abstraction. Hence, the multi-search abstraction leverages coarse-grained parallelism because each search can be executed independently as well as fine-grained parallelism because the searches themselves can also be parallelized. In the context of GPU computing, this method of abstraction is especially useful for graphs that are too sparse to fully occupy the GPU with a single breadth-first search.

C. Multi-Search Algorithms

This subsection describes several fundamental graph algorithms that can be built on top of the multi-search abstraction. Many of these algorithms are used as subroutines themselves, showing the variety of use cases for the abstraction.

1) All-Pairs Shortest Paths: The All-Pairs Shortest Paths (APSP) problem finds the shortest paths between all pairs of vertices in a graph. The results can be represented as either the specific distances between each pair of vertices or as the paths themselves. For the latter representation, each vertex can store its parent in the breadth-first search tree that starts from the source. Note however, that the latter representation is nondeterministic as multiple valid parents may exist for a given vertex. For our experiments the choice of representation has a negligible impact on performance, so we choose to compute distances as has been done in other work in this area [7], [11], [29].

\begin{algorithm}[H]
\caption{Algorithm 1: The Floyd-Warshall Algorithm}
\begin{algorithmic}[1]
  \For{$k \leftarrow 0 \ldots n - 1$}
  \For{$i \leftarrow 0 \ldots n - 1$}
  \For{$j \leftarrow 0 \ldots n - 1$}
  \State $d[i][j] \leftarrow \min(d[i][j], d[i][k] + d[k][j])$
  \EndFor
  \EndFor
\EndFor
\end{algorithmic}
\end{algorithm}

A canonical approach to solving the APSP problem is to use the Floyd-Warshall (FW) algorithm, shown in Algorithm 1. Algorithm 1 assumes that $d[u][v]$ is initialized to 0 when $u = v$ and the weight of the edge $u \rightarrow v$ otherwise (or $\infty$ if no such edge exists). Having an $O(n^3)$ complexity that is independent of the number of edges in the graph makes this approach well-suited to dense graphs. In this paper we instead focus our attention on sparse graphs, as graphs found in real world applications tend to be sparse [8], [10].

\begin{algorithm}[H]
\caption{Algorithm 2: Simplified Version of Johnson’s Algorithm for Sparse Graphs}
\begin{algorithmic}[1]
  \If{$s \in V$}
  \State $Q.enqueue(s)$
  \While{$\neg Q.empty()$}
  \State $v \leftarrow Q.dequeue()$
  \For{$w \in succ(v)$}
  \If{$d[s][u] = \infty$}
  \State $Q.enqueue(w)$
  \State $d[s][w] \leftarrow d[s][v] + 1$
  \EndIf
  \EndFor
  \EndWhile
\EndIf
\end{algorithmic}
\end{algorithm}
Algorithm 2 shows a simplified version of Johnson’s algorithm that two extremes [41]. Determining whether vertices can reach one has even proposed alternative methods that fall in between these \(O\) result but takes \(O\) from the source of the query in an attempt to find the destination \(O\) the entire transitive closure as a matrix, using it is not. A number of trade-offs exist for computing the simple: if \(d\) \(d\) for certain classes of parallel algorithms [30].

The computer networks in order to minimize latency, cost, and a graph’s diameter is useful for designing the topology of \(d\) vertex is reachable from \(\delta\) \(\delta\) \(\delta\) \(\delta\) \(\delta\) \(\delta\) \(\delta\) \(\delta\) \(\delta\) \(\delta\) \(\delta\) \(\delta\)

### 4) Betweenness Centrality

Betweenness Centrality is a metric that attempts to determine the most influential or important vertices (or edges) in a network. The metric quantitatively measures importance by comparing the number of shortest paths passing through a particular vertex to the total number of shortest paths found in the graph. If we let \(\sigma[s][t]\) be the number of shortest paths from a vertex \(s\) to another vertex \(t\) and \(\sigma[s][t](v)\) be the number of those paths that pass through a third vertex \(v\), we can define the BC score of \(v\) as follows:

\[
BC[v] = \sum_{s \neq t \neq v} \frac{\sigma[s][t](v)}{\sigma[s][t]}
\]

Brandes defined the dependency, which relates the BC scores of a vertex to its successors (from the perspective of a given source vertex \(s\)) [5]:

\[
\delta[s][v] = \sum_{w \in succ(v)} \frac{\sigma[s][w]}{\sigma[s][v]} (1 + \delta[s][w])
\]

This recursive relationship allows for the computation of betweenness centrality in two steps of the multi-search abstraction. The first is a downward traversal from \(s\) that solves the APSP problem and counts the number of shortest paths between all pairs of vertices. The second uses this information to sum dependencies between each pair of vertices back up the BFS tree until \(s\) is reached. The BC scores can be redefined in terms of the dependencies as follows:

\[
BC[v] = \sum_{s \neq v} \delta[s][v]
\]

With applications in electronic design automation, urban planning, and the analysis of the human brain, betweenness centrality has received much attention in recent literature [6], [21], [37]. BC has also been used a building block for more complicated algorithms, such as community detection [16].

### III. RELATED WORK

#### A. All-Pairs Shortest Paths

Noting that the APSP problem is a precursor to many other graph algorithms, it is not surprising that it has received significant attention in the literature. Prior implementations of the APSP problem tend to focus on dense graphs, distributed memory systems, and graphs containing fewer than 100,000 vertices. The APSP problem has been implemented on a rather diverse set of architectures, including FPGAs [4], GPUs [23], heterogeneous systems [29], and supercomputers [40]. We focus our work on GPU implementations of the APSP problem for large, sparse graphs representative of unstructured data found in real world applications [8].

Bondhugula et al. present an FPGA-based APSP implementation based on motivation from an application in bioinformatics...
They developed a tiled approach to solving the Floyd-Warshall algorithm, so their methods are most effective when applied to dense graphs. Katz and Kider implement the APSP algorithm on the NVIDIA G80 architecture, also using a tiled version of FW [23]. They present results for graphs with up to \( n = 11,264 \) vertices and show a method for handling graphs that are larger than the amount of memory provided by a single GPU. Buluç et al. use a blocked recursive elimination strategy to solve the APSP problem on the GPU by noting that APSP corresponds to finding the matrix closure of the graph’s adjacency matrix on the tropical semiring [7]. Using an NVIDIA GeForce 880 Ultra GPU, the authors provide an implementation that runs more than two orders of magnitude faster than an Opteron CPU. Matsumoto et al. provide yet another approach to computing the FW algorithm in a blocked fashion, this time on a heterogeneous CPU-GPU system [29]. Their results scale to graphs as large as \( n = 43,776 \) vertices, exceeding 1 TFLOPS in single-precision performance. Solomonik et al. implement a communication-avoiding block-cyclic APSP algorithm on a Cray XE6 supercomputer [40]. Djidjev et al. partition the input graph, solving the APSP problem independently on each component and unifying the results in a post-processing stage [11]. Although they focus on planar graphs, the authors present results on graphs with millions of vertices on a cluster with hundreds of GPUs.

The collection of work above focuses on algorithms that require \( O(n^2) \) intermediate storage space (i.e., they use a chunk of size \( n \)). This choice of chunk size causes scalability issues on shared memory systems, requiring the deployment of these algorithms on large distributed systems to analyze graphs similar in size to the ones we study in this paper. Okuyama et al. instead use an approach similar to that of Johnson’s algorithm and found that the cost of such an approach was that the edge distribution plays a fundamental role in the performance of the algorithm [35]. However, the presented results are shown for graphs with up to only \( n = 32,768 \) vertices. Our approach contributes to this area by scaling to much larger graphs with only a single GPU through the use of a smaller chunk size, achieving performance comparable to that of previous work on large distributed systems. For instance, Solomonik et al. solve the APSP problem for a graph with 65,536 vertices in roughly two minutes on a system with 1,024 nodes (24,576 cores) [40]. They neglect to mention the density or structure of this graph, but since their implementation is matrix-based, their performance should be roughly equivalent regardless of the graph’s sparsity. For a randomly generated Delaunay mesh and Kronecker graph of the same size, our implementation requires just 41 and 99 seconds, respectively, on a single GPU. Hence, we provide a cost-effective, scalable, and fast solution to the APSP problem for sparse graphs. Furthermore, since we design our implementation as a general abstraction, other problems can leverage its results.

### B. Parallel Abstractions for Graph Analysis

Recently, a number of shared memory and distributed graph programming frameworks have been developed in order to abstract the complicated details of conducting high-performance graph algorithms on parallel architectures [14], [24], [34], [39], [42]. Many of these frameworks were inspired in part by the Parallel Boost Graph Library [18]. These frameworks tend to employ abstractions in the context of linear algebra, graph traversal, or the Gather-Apply-Scatter (GAS) paradigm. Popularized by the GraphLab framework [17], the Gather-Apply-Scatter (GAS) abstraction executes algorithms by repeatedly applying three steps:

1. **Gather**: Collect information about vertices and edges that are adjacent to the active frontier.
2. **Apply**: Update the vertices in the active frontier based on the gathered information.
3. **Scatter**: Use these updates to determine the vertices and edges that belong to the next frontier.

Alternatively, traversal-based abstractions have the algorithm developer provide code that is applied to vertices or edges in the current frontier and sets up the frontier for the following search iteration. Finally, linear algebraic abstractions formulate graph algorithms as operations on vectors and matrices. For instance, a breadth-first search can be represented as a SpMV between the adjacency matrix of the graph and a vector representing the active vertex frontier on the \((\min, +)\), or tropical, semiring. The GraphBLAS standardizes a common set of building blocks for graphs through the language of linear algebra [24]. Our abstraction differs from a sparse matrix product in that the user writes a function to visit vertices instead of defining a semiring, and is perhaps more general because of this difference. We also consider it more intuitive for the user to write a function in terms of vertex frontiers than to define a semiring.

### IV. Multi-Search Implementation

Our work complements existing paradigms as it provides a related abstraction that focuses on problems that require many graph traversals that can be executed independently. We view the multi-search abstraction as a generalization of the traversal-based abstraction: the more coarse-grained parallelism that is available, the more likely one will benefit through the use of the multi-search paradigm rather than the traversal-based paradigm. In the event that only one search is desired, the multi-search abstraction simply reduces to the traversal-based abstraction. This section presents an efficient, cooperative implementation of the multi-search abstraction that can be used to develop a number of useful graph algorithms (such as the ones described in Section II-C). Similar to GAS and traversal-based methods, the multi-search abstraction derives its utility from decoupling the complicated details of the underlying graph traversals from the specific updates that need to occur for the higher-level algorithm at hand. Hence, users are only required to implement a few small functions that can all utilize the same device kernel for graph traversals, encouraging code reuse and alleviating programmers from having to implement their own error-prone sets of parallel graph traversals.

Algorithm 3 shows our cooperative implementation of the multi-search abstraction. Careful implementation of the abstraction is rather important, since it will profoundly affect the performance of all the algorithms that leverage the abstraction. Users of the abstraction implement a small number of functions:

- \texttt{init()}: Initialize data structures at the beginning of a search from source \( i \).
- \texttt{prior()}: Handle any computation that may occur just prior to a search iteration.
Algorithm 3: Pseudocode for the Multi-Search Abstraction Kernel

// Loop across SMS
for i ∈ S do in parallel
    Qprev[0] ← i
    Qprev_len ← 1
    Qnext_len ← 0
    init(i)
    barrier()
    while Qprev_len ≠ 0 do
        prior() -
        barrier()
        for v ∈ Qprev do
            // Loop across threads
            for w ∈ neighbors(v) do in parallel
                swap(Qprev, Qnext)
                barrier()
                Qprev_len ← Qnext_len
                Qnext_len ← 0
                post()
                barrier()
            finalize(i)

- visitVertex(): When an edge (u, v) is traversed from source i, update the appropriate data structures in terms of u, v, and i.
- post(): Handle any computation that may occur at the end of a search iteration.
- finalize(): Handle any computation that may occur after the search from source i is complete.

These functions are made available for generality and convenience to the end user; it isn’t expected that the user will necessarily need all of these functions for their applications. Algorithm 3 represents just one implementation of the abstraction for a set of hardware problems. Others are possible, and users of the abstraction won’t need to change their code when implementations of the abstraction are improved.

The for loop on Line 1 is executed in parallel across the Streaming Multiprocessors (SMs) of the GPU. In practice, the user’s case may only require a subset of vertices to search, so we define the variable S to be this user-defined set. For the APSP problem, S = V. The number of vertices in the graph vastly outnumbers the number of SMs on the GPU, so each SM sequentially processes many iterations of this for loop. The while loop on Line 7 is executed by all of the threads within each SM. We organize work at the warp level, where a warp is a group of (as of this writing) 32 threads that execute in lockstep within each SM. Initially we assigned each thread to its own queue element and had warps cooperatively process the adjacency lists of these elements one at a time. Although this approach sufficiently balanced the work among threads within each warp, it left an imbalance of work between warps. For sufficiently small queues one warp could be left processing the entire frontier while the other warps idle. Improving upon this approach, we implemented a dynamic scheduling policy that has warps asynchronously dequeue vertices in the current vertex frontier. Instead of being statically assigned explicit batches of vertices within each vertex frontier to process, warps dequeue the next vertex after processing their current vertex. Hence, a warp only will idle when there is no work remaining in the current frontier.

The threads within each warp cooperatively process the edges outgoing from to the dequeued vertex collected by that warp. This cooperation leverages the __shfl() intrinsic introduced by NVIDIA’s Kepler architecture. The shuffle intrinsic allows for fast communication within a warp without requiring the use of shared memory. For instance, __shfl(x, y) returns the value of x held by thread y to all of the other threads in the warp. Each thread in the warp traverses consecutive outgoing edges from that queue element.

Algorithm 4: Implementation of init() for the All-Pairs Shortest Paths Problem

for k ∈ [V] do in parallel
    if k = i then
        d[i][k] ← 0
    else
        d[i][k] ← ∞

Implementing the user-defined functions to implement the APSP algorithm on top of the multi-search abstraction is fairly straightforward. Algorithms 4 and 5 show APSP-specific implementations of init() and visitVertex(), respectively. The implementations for prior(), post(), and finalize() can be left empty for this algorithm but may be necessary for more complicated algorithms. Algorithm 4 simply initializes d[i][v] ∀ u, v ∈ V × V. Algorithm 5 simply checks if w has been visited. If not, it is atomically added to Qnext to avoid race conditions. Note that duplicate entries in the queue are possible, since multiple threads may see w as unvisited before the first of these threads writes to d[i][w]. In practice, these duplicates are rare because they require either duplicate edges or multiple warps to simultaneously execute the same instruction. In our tests we found that the atomicAdd() on Line 3 was faster than having each warp prefix scan whether or not it found an unvisited vertex. This result makes sense because the atomic operation is with respect to a location in shared memory and because warps would have to perform their own scans since each warp expands a different queue element. For algorithms that require the number of shortest paths between each pair of vertices, an atomic Compare and Swap (CAS) operation must be used to set the distances of unvisited vertices. Otherwise, certain paths could be double counted, leading to incorrect results. Although atomic operations are required here, in the future we plan to provide our own queue data structure
that provides a “safe” enqueuing function to abstract such operations away from end users.

Several other methods for solving problems fitting this paradigm on the GPU are essentially algorithms that solve the APSP problem without the explicit use of this abstraction. For instance, Jia et al. present vertex and edge-parallel methods for computing betweenness centrality, noting that the edge-parallel approach better maximizes the memory bandwidth achieved by the GPU [20]. Similarly, McLaughlin and Bader provide an approach that works particularly well for computing the betweenness centrality of high-diameter graphs [30]. Both of these methods employ an approach to BC that reflects the multi-search abstraction in that the APSP problem is solved for each vertex before dependencies are computed. Hence, we can compare their mappings of simultaneous graph traversals to the GPU.

Figure 1 shows an example of how these methods differ for a simple graph. Consider a streaming multiprocessor that has been assigned a breadth-first search from vertex 4. In the first iteration of this search, vertex 4 will enqueue vertices 3 and 5, which become the active vertex frontier for the next iteration of the search. Figure 1 reflects this state, as vertices 3 and 5 are marked as active (shaded). Beneath the picture of the graph in Figure 1 we show how the work-efficient approach [30], the edge-parallel approach [20], and our cooperative approach from Algorithm 3 assign the threads within the SM to edge traversals. The work-efficient approach assigns warps within each SM of the GPU to vertices on the active frontier. Hence, the first thread traverses the three outgoing edges from vertex 3 and a second thread traverses the outgoing edge from vertex 5. Note that although this method only traverses edges in the active edge frontier, threads have data-dependent amounts of work to do and hence the amount of work per thread can vary tremendously using this approach, leading to potentially severe load imbalances. The edge-parallel approach simply assigns a thread to every edge in the graph, regardless of whether or not it is in the active frontier. This approach easily occupies the GPU as many threads are needed to process iterations of moderate size; however, not all of these threads are contributing to the progress of the algorithm. Finally, our cooperative approach assigns warps to the adjacency lists of each vertex in the active frontier one by one. Hence, the three outgoing edges of vertex 3 are processed by 3 threads of the one warp and the lone edge from vertex 5 is then processed by one thread from a second warp, all in parallel. When graphs are sufficiently large such that the average adjacency list of a vertex tends to be greater than the warp size of the architecture, the utilization of each warp is high and the only edges processed by threads within each warp are edges in the current edge frontier. Furthermore, since this process is cooperative, threads have a well-partitioned amount of work.

Figure 2 shows the scalability of our cooperative approach presented in Algorithm 3 when it is used to solve the All-Pairs Shortest Paths problem. We use randomly generated Erdős-Rényi graphs and vary the number of vertices and edges to see how these changes impact performance. The edge factor influences the probability $p$ of an edge selection in the $G(n, p)$ Erdős-Rényi random model [13]. Intuitively, a graph generated with twice the edge-factor will contain twice as many edges; however, the edge-factor should not be mistaken for average degree. The average degree of graphs with edge-factor 5 used to generate Figure 2 is approximately 28.

Figure 2 shows that our methodology is robust to scaling both the number of vertices and the number of edges in the graph. On average, increasing the edge factor by two results in a 1.96x increase in execution time. One reason for why this increase in execution time is slightly less than the expected theoretical increase of 2x is that additional edges can provide better warp occupancy. For instance, if the degree of a vertex modulo the warp size of the architecture is small (but nonzero), additional edges will only give unoccupied threads work until all threads are occupied. Increasing the number of vertices by a factor of two (which, for the same edge factor, also increases the number of edges by a factor of two) results in an increase in execution time of 4.64x on average. Since the computational complexity of the algorithm is $O(mn)$, we theoretically expect to see a factor of 4 increase in execution
We show results for CPU tests using 4 threads, since the use was run on NVIDIA GeForce GTX Titan and NVIDIA Tesla 7.0 toolkit, which we leverage for C++11 support in device memory bandwidth as the GeForce GTX Titan.

The additional 0.64x increase that we see in practice could result from contention in resources when accessing memory atomically as well as from potential load imbalances among SMs.

V. EXPERIMENTAL SETUP

In the next section we present performance results that show the scalability of the multi-search abstraction as well as how it performs for several classes of real-world graphs. To show the utility of the abstraction itself we implement betweenness centrality on top of it and compare the performance of our implementation to that of recent literature. CPU results were run on an Intel Core i7-2600K processor. The Core i7-2600K has a frequency of 3.4 GHz, and 8 MB last level cache, four physical processor cores and a peak memory bandwidth of 21 GB/s. We show results for CPU tests using 4 threads, since the use of hyperthreading didn’t improve performance. GPU results were run on NVIDIA GeForce GTX Titan and NVIDIA Tesla K40c GPUs. The GeForce GTX Titan is a single GPU, we can scale to significantly larger graph instances allowing for work-efficiency as well as sufficient load-balancing among concurrent threads.

CPU code was compiled using g++ version 4.8.1 and OpenMP. GPU code was compiled using nvcc and the CUDA 7.0 toolkit, which we leverage for C++11 support in device functions, allowing us to use lambda functions to implement the user-defined portions of the multi-search abstraction.

We present results based on publicly available graph data sets from the 10th DIMACS Challenge [1], the Stanford Network Analysis Platform [27], and the University of Florida Sparse Matrix Collection [10]. Table I shows more information about the set of graphs we perform tests on, including the number of vertices and number of (directed) edges for each graph, the significance of each data set, and finally the sparsity pattern of each data set. Note that we use both real-world and randomly generated graphs with highly varying connectivity from regular numerical meshes to irregular scale-free graphs.

For scaling experiments, we compare against the work-efficient [30] and edge-parallel approaches [20] described in Section IV and contrasted in Figure 1. These techniques are GPU-based and all of these experiments were run on the Tesla K40c GPU. For the experiments on the benchmarks in Table I, we compare against both CPU and GPU implementations, where all GPU experiments were run on the GeForce GTX Titan GPU.

VI. EXPERIMENTAL RESULTS

A. All-Pairs Shortest Paths

Figure 3 compares APSP execution times using the three methods of graph traversal shown in Figure 1. Figure 3a shows results for a high-diameter Delaunay mesh, which typically requires hundreds of search iterations to find all reachable vertices from a given source vertex. In contrast, Figure 3b shows results for a low-diameter, scale-free Kronecker graph, which typically requires fewer than ten search iterations to complete a Breadth-First Search. We can see that for the Delaunay mesh, the work-efficient approach from [30] is preferential to the edge-parallel approach from [20] whereas for the Kronecker graph, the opposite is true. This notion that the graph structure has significant performance implications lead to the hybrid approach presented in [30]. However, we can see that for both of these classes of graphs, our cooperative approach from Algorithm 3 is more robust to the structure of the graph, performing more than twice as fast on large scales of both high and low-diameter graphs. The work-efficient approach performs poorly on scale-free networks since the power-law distribution of vertex degree leads to severe load imbalances among threads. The edge-parallel approach, in contrast, does better on scale-free networks because a larger percentage of edges are active at once and since threads have an equivalent amount of work. However, this approach performs poorly on the smaller vertex frontiers seen in high-diameter graphs since a majority of threads will be assigned to edges that don’t actually need to be inspected. The cooperative approach alleviates these issues by having the warps within each SM asynchronously process adjacency lists, allowing for work-efficiency as well as sufficient load-balancing among concurrent threads.

Table II shows the time required to solve the APSP problem as well as the maximum degree for our set of benchmark graphs. We include the maximum degree simply to enhance the information given in Table I, where we could not fit it. Using a single GPU, we can scale to significantly larger graph instances in comparison to existing methods, since our approach uses a relatively small chunk of simultaneous source vertices. Existing implementations tend to either use large distributed systems to

---

1. Source code: https://github.com/Adam27X/graph-utils/
scale to graphs this large [11], [40] or restrict their studies to smaller instances of graphs on shared memory machines [4], [7], [23], [29].

It is intriguing to note that our performance on ldoor is better than that of kron_g500-logn19 despite the fact that ldoor is a slightly larger graph. The irregularity of Kronecker graphs makes them particularly challenging to process. Significant warp divergences may occur when one warp processes the vertex with the largest number of edges, causing other warps in the block to idle before moving on to the next vertex frontier. Splitting vertices with especially large frontiers into virtual vertices is one potential improvement that could alleviate this issue [38] that we intend to explore for future work.

B. Betweenness Centrality

Since we present this work as an abstraction that can be applied to a number of problems requiring many simultaneous breadth-first searches, it is important to show the performance of such problems under our abstraction. We compare our approach to four implementations of Betweenness Centrality from recent literature: Ligra [39], Galois [34], Gunrock [42], and a hybrid GPU implementation from [30]. Ligra is a shared-memory CPU graph framework that uses a traversal-based abstraction that allows users to write graph algorithms that map over frontiers of edges and vertices (or subsets thereof). Galois is a CPU-based system that provides the user with parallel set iterators, allowing the user to write sequential code that specifies loops that should be run in parallel. Rather than using the bulk synchronous parallel model of execution, Galois uses worklists to implement asynchronous execution. Gunrock has a similar programming interface to Ligra, but is written in CUDA for execution on GPU backends. It includes an advance stage that visits the current vertex frontier as well as a filter stage that generates the next frontier. Ligra, Galois, and Gunrock provide their own implementations of betweenness centrality, which we use for our experiments. Finally, the hybrid_BC GPU implementation of betweenness centrality uses an online approach to determine whether a graph will benefit more from either the work-efficient or edge-parallel methods that were shown in Figure 1. In terms of programmability, Galois is the most general as it can implement any worklist-based algorithm, Ligra and Gunrock are specialized to traversal-based graph algorithms, and hybrid_BC is a manual implementation that is specialized solely for betweenness centrality. Our multi-search abstraction is meant for algorithms requiring many graph traversals, but could be specialized to act similarly to Gunrock or Ligra in the event that a sufficient number of traversals aren’t available for the user’s application.

Table III shows timing results for each of these baselines as well as our own cooperative approach. Table IV summarizes the results in Table III by showing the average speedup our cooperative approach attains over the methods that we compare to from prior literature. For all tests we approximate BC scores to from prior literature. For all tests we approximate BC scores that aren’t available for the user’s application.

<table>
<thead>
<tr>
<th>Graph</th>
<th>APSP Time (s)</th>
<th>Maximum Degree</th>
</tr>
</thead>
<tbody>
<tr>
<td>333SP</td>
<td>93150</td>
<td>28</td>
</tr>
<tr>
<td>adaptive</td>
<td>342253</td>
<td>4</td>
</tr>
<tr>
<td>as-Skitter</td>
<td>28333</td>
<td>35455</td>
</tr>
<tr>
<td>auto</td>
<td>2291</td>
<td>37</td>
</tr>
<tr>
<td>delaunay_n21</td>
<td>27432</td>
<td>23</td>
</tr>
<tr>
<td>ecology1</td>
<td>9187</td>
<td>4</td>
</tr>
<tr>
<td>hollywood-2009</td>
<td>43082</td>
<td>11469</td>
</tr>
<tr>
<td>kron_g500-logn19</td>
<td>10236</td>
<td>80674</td>
</tr>
<tr>
<td>ldoor</td>
<td>7802</td>
<td>76</td>
</tr>
<tr>
<td>roadNet-CA</td>
<td>34358</td>
<td>12</td>
</tr>
<tr>
<td>rgg_n_2_21_s0</td>
<td>49991</td>
<td>37</td>
</tr>
<tr>
<td>thermal2</td>
<td>13349</td>
<td>10</td>
</tr>
</tbody>
</table>

Fig. 3. Comparison of multi-search traversal techniques for two classes of networks.

**TABLE II. BENCHMARK RESULTS FOR SOLVING THE APSP PROBLEM.**
TABLE III. BENCHMARK RESULTS FOR COMPUTING BETWEENNESS CENTRALITY. TIMES ARE IN SECONDS. THE FASTEST RESULT FOR EACH GRAPH IS PRESENTED IN BOLD.

<table>
<thead>
<tr>
<th>Framework</th>
<th>333SP</th>
<th>adaptive</th>
<th>as-Skitter</th>
<th>auto</th>
<th>delaunay_n21</th>
<th>ecology1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Galois</td>
<td>4651</td>
<td>7086</td>
<td>1167</td>
<td>637</td>
<td>2004</td>
<td>906</td>
</tr>
<tr>
<td>Ligra</td>
<td>3005</td>
<td>3442</td>
<td>1241</td>
<td>665</td>
<td>992</td>
<td>635</td>
</tr>
<tr>
<td>Gunrock</td>
<td>1999</td>
<td>4851</td>
<td>N/A</td>
<td>161</td>
<td>712</td>
<td>1458</td>
</tr>
<tr>
<td>hybrid_BC</td>
<td>781</td>
<td>993</td>
<td>518</td>
<td>407</td>
<td>373</td>
<td>176</td>
</tr>
<tr>
<td>Cooperative</td>
<td>352</td>
<td>601</td>
<td>275</td>
<td>74</td>
<td>174</td>
<td>104</td>
</tr>
</tbody>
</table>

TABLE IV. AVERAGE SPEEDUP OF THE COOPERATIVE APPROACH OVER EXISTING FRAMEWORKS.

<table>
<thead>
<tr>
<th>Framework</th>
<th>hollywood-2009</th>
<th>kron_g500-logn19</th>
<th>ldoor</th>
<th>roadNet-CA</th>
<th>rgg_n_2_21_50</th>
<th>thermal2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Galois</td>
<td>2058</td>
<td>1868</td>
<td>1240</td>
<td>1498</td>
<td>3518</td>
<td>1088</td>
</tr>
<tr>
<td>Ligra</td>
<td>4318</td>
<td>623</td>
<td>1751</td>
<td>700</td>
<td>2808</td>
<td>899</td>
</tr>
<tr>
<td>Gunrock</td>
<td>630</td>
<td>406</td>
<td>395</td>
<td>N/A</td>
<td>N/A</td>
<td>277</td>
</tr>
<tr>
<td>hybrid_BC</td>
<td>1591</td>
<td>522</td>
<td>621</td>
<td>403</td>
<td>1066</td>
<td>204</td>
</tr>
<tr>
<td>Cooperative</td>
<td>602</td>
<td>523</td>
<td>183</td>
<td>145</td>
<td>399</td>
<td>115</td>
</tr>
</tbody>
</table>

Compared to the parallel CPU implementations of Galois and Ligra our implementation does very well, averaging 7.66x and 5.82x speedups, respectively. The results in comparison to Gunrock are more interesting in that they vary tremendously. Since Gunrock uses a chunk size of 1 (i.e. it only leverages fine-grained parallelism for BC), it does particularly poorly on graphs with low average degree, such as ecology1 and adaptive. On the other hand, Gunrock performs well on graphs that do offer lots of fine-grained parallelism, such as hollywood-2009, where its performance is competitive with ours and kron_g500-logn19 where its performance is even better than our own. The entries denoted “N/A” in Table III for Gunrock correspond to graphs that caused a memory access violation on the GPU. For the graphs that we could compare, our implementation was 3.07x faster on average than Gunrock. Finally, our GPU abstraction is competitive with GPU code that is specialized for computing BC scores. The hybrid_BC implementation is never significantly faster than that of our own yet for ldoor our approach is 3.41x faster. Overall, our cooperative approach is 2.24x faster than hybrid_BC on average and is much more easily leveraged for the development of algorithms that can take advantage of the multi-search abstraction. Even though hybrid_BC is specialized for BC, our approach is faster because of our efficient implementation of graph traversals shown in Algorithm 3 and Figure 1.

VII. CONCLUSIONS

In this paper, we present and provide an efficient implementation of an abstraction for processing many simultaneous breadth-first searches in parallel on the GPU. We implement the abstraction by enlisting the threads within each warp to cooperatively traverse the edges in elements in the active vertex frontier. This approach is more than twice as fast as previous GPU approaches that were used to schedule threads for simultaneous graph traversals for large graphs of both low and high diameter. Furthermore, our approach scales to graphs with millions of vertices using a single GPU whereas previous approaches used large clusters to solve problems of similar size in greater amounts of time. Finally, our abstraction can efficiently implement more complicated algorithms. We show that an implementation of betweenness centrality that leverages our abstraction achieves an average speedup of 7.66x and 5.82x over the Galois and Ligra multi-core graph frameworks, a 3.07x speedup over the Gunrock GPU graph framework, and an average speedup of 2.24x over a heavily optimized on-line GPU implementation of betweenness centrality.

The literature on parallel implementations of graph algorithms is beginning to shift from manual, hand-tuned implementations of specific algorithms to libraries that provide abstractions for certain classes of parallel algorithms. The appropriate choice of an abstraction depends on the problem that needs to be solved, the way in which each abstraction is mapped to hardware, and the graph being analyzed. We consider the definition of new abstractions and the unification existing abstractions into a general parallel graph analytics framework to be an exciting area of future work.

ACKNOWLEDGMENT

The work depicted in this paper was partially sponsored by Defense Advanced Research Projects Agency (DARPA) under agreement #HR0011-13-2-0001. The content, views and conclusions presented in this document do not necessarily reflect the position or the policy of DARPA or the U.S. Government, no official endorsement should be inferred. Distribution Statement A: “Approved for public release; distribution is unlimited.” This work was also partially sponsored by NSF Grant ACI-1339745 (XScala). Finally, we would like to thank NVIDIA Corporation for their donation of GeForce GTX Titan and Telsa K40 GPUs.

REFERENCES
