Implementing Breadth-first Search Over gRPC
deps.cloud is an open source project that I started.
It’s a tool that helps companies understand how their projects relate to one another.
It does this by parsing files like
package.json and storing the contents in a graph.
While graph databases do exist, finding administrative and engineering support is often hard. To add complexity to this, graph databases come in a variety of flavors. Since I wanted the workload to be portable, adopting a graph database was a non-starter.
On the other hand, finding support for relational databases is easy. The problem is that implementing graphs on relational databases tend to be slow. While there has been previous efforts, I felt gRPC was able to alleviate many of the problems they faced. In this post, I share lessons I learned while implementing such a graph database.
The Database Schema
Before talking about the service layer, we first need to cover how we store data.
A common approach is to separate the data out into two separate tables.
One table stores the
nodes, or entities, of the graph.
The other stores the
edges, or associations.
The problem with this approach is that it often complicates the query logic.
In order to get a full picture, you need to join across 3 tables (
k1 == k2, the row is a node.
k1 != k2, the row is an edge.
k3field allows nodes to have multiple edges between them.
- Keys have a maximum length to keep the index small.
secondarykey improves performance of queries in the inverse direction.
Early on, I wanted a very straight forward interface. In some of my previous work, the data felt more like key/value data. I often used bulk updates or bulk deletes to manage my information. Scanning the data was a semi-common task. Finally, I needed to do prefix scans on both indexes. That is:
k1, k2when querying for the egress (upstream) edges.
k2, k1when querying for the ingress (downstream) edges.
As a result, I wound up with a pretty minimal key/value interface. While small, it provides the necessary building blocks for other calls.
In addition to the key/value interface, I needed away to query the graph efficiently. I started by implementing topological sorting as a client-only feature. The problem was this required constant communication between the client and server. This pushed me more towards a stream-based API. gRPC provides three different types of streaming calls:
- Client (eg. file upload)
- Server (eg. file download)
- Bidirectional (eg. chat app)
For this use case, either a server or bidirectional streaming API would work. The main goal is to reduce the amount of back and forth between the clients. In many traversal algorithms you usually start with a set of nodes, and a strategy to apply.
For example, consider breadth-first search. Engineers often use it to navigate tree and graph based data structures. It works by taking a starting node, traversing its children, and repeating the process for each child. What we see here is that we start with a single request that quickly turns into a lot of data. Hence, the server streaming capability.
In my implementation, I chose a bidirectional stream. The reason is these search algorithms often terminate early once it finds the node. I wanted clients to be able to send along a stop request to prevent any further queries.
Implementing Breadth-first Search
In my opinion, breadth-first search is the easier of the two functions. Depth-first search requires further conversation around ordering and how that impacts the results.
In this section, I’m going to use an in-memory map to represent the graph.
You can imagine this being swapped out with the
Before we introduce streaming concepts to the code, lets start with an iterative BFS.
An iterative solution is key here.
It allows you to buffer and discard layers, thus keeping memory requirements low.
While there are a few inefficiencies in the code, it illustrates the overall structure.
Next, we can introduce the streaming concepts.
In introducing these, we no longer need to track the
results of a function.
Instead, we can write the current tier out to the client as we go.
Unlike many previous implementations, this allows the server to progressively navigate the graph. It keeps server foot prints low, and requires clients to size accordingly for their query. It improves the latency of calls by reducing round trips between clients and servers. The was just first pass at implementing such a service. Throughout the process, I’ve come up other ways to continue improve the performance. All in all, I’m excited to see how these types of operations scale as the amount of data stored grows.