Implementing k-means clustering from scratch in C++

I have a somewhat complicated history when it comes to C++. When I was 15 and teaching myself to code, I couldn’t decide between python and C++ and as a result tried to learn both at the same time. One of my first non-trivial projects was a C++ program to compute orbits – looking back on it now, I can see that what I was actually doing was a (horrifically inefficient) implementation of Euler’s method. I just couldn’t wrap my head around fixed-size arrays (not to mention pointers!). In any case, I soon realised that juggling C++ and python was untenable – not only was I new to the concepts (such as type systems and OOP), I was having to learn two sets of syntax in addition to two flavours of these concepts. I decided to commit to python and haven’t really looked back since.

Now, almost 6 years later (tempus fugit!), having completed the first-year computer science course at Cambridge, I feel like I am in a much better place to have a proper go at C++. My motivation is helped by the fact that all of the second-year computational practicals for physics are done in C++, not to mention that C++ is incredibly useful in quantitative finance (which I am deeply interested in).

To that end, I decided to jump straight in and implement a machine learning algorithm from scratch. I chose k-means because of its personal significance to me: when I was first learning about ML, k-means was one of the first algorithms that I fully grokked and I spent quite a while experimenting with different modifications and implementations in python. Also, given that the main focus of this post is to learn C++, it makes sense to use an algorithm I understand relatively well.

Please let me add the disclaimer that this is certainly not going to be an optimal solution – this post is very much a learning exercise for me and I’d be more than happy to receive constructive criticism. As always, all code for this project can be found on GitHub.

Contents

What is k-means clustering?

I have decided to give four brief explanations with increasing degrees of rigour. Nothing beyond the first explanation is really essential for the rest of this post, so feel free to stop whenever.

  1. k-means clustering allows us to find groups of similar points within a dataset.
  2. k-means clustering is the task of finding groups of points in a dataset such that the total variance within groups is minimised.
  3. k-means clustering is the task of partitioning feature space into k subsets to minimise the within-cluster sum-of-square deviations (WCSS), which is the sum of quare euclidean distances between each datapoint and the centroid.
  4. Formally, k-means clustering is the task of finding a partition $S = \{S_1, S_2, \ldots S_k\}$ where $S$ satisfies:

The k-means algorithm

The k-means clustering problem is actually incredibly difficult to solve. Let’s say we just have $N=120$ and $k=5$, i.e we have 120 datapoints which we want to group into 5 clusters. The number of possible partitions is more than the number of atoms in the universe ($5^{120} \approx 10^{83}$) – for each one, we then need to calculate the WCSS (read: variance) and choose the best partition.

Clearly, any kind of brute force solutions is intractable (to be specific, the problem has exponential complexity). Hence, we need to turn to approximate solutions. The most famous approximate algorithm is Lloyd’s algorithm, which is often confusingly called the “k-means algorithm”. In this post I will silence my inner pedant and interchangeably use the terms k-means algorithm and k-means clustering, but it should be remembered that they are slightly distinct. With that aside, Lloyd’s algorithm is incredibly simple:

1. Initialise the clusters

The algorithm needs to start somewhere, so we need to come up with a crude way of clustering points. To do this, we randomly select k points which become ‘markers’, then assign each datapoint to its nearest marker point. The result of this is k clusters. While this is a naive initialisation method, it does have some nice properties - more densely populated regions are more likely to contain centroids (which makes logical sense).

2. Compute the centroid of each cluster

Technically Lloyd’s algorithm computes the centroid of each partition of 3D space via integration, but we use the reasonable approximation of computing the centre of mass of the points in a given partition. The rational behind this is that the centroid of a cluster ‘characterises’ the cluster in some sense.

3. Assign each point to the nearest centroid and redefine the cluster

If a point currently in cluster 1 is actually closer to the centroid of cluster 2, surely it makes more sense for it to belong to cluster 2? This is exactly what we do, looping over all points and assigning them to clusters based on which centroid is the closest.

4. Repeat steps 2 and 3

We then repeatedly recompute centroids and reassign points to the nearest centroid. There is actually a very neat proof that this converges: essentially, there is only a finite (though massive) number of possible partitions, and each k-means update at least improves the WCSS. Hence the algorithm must converge.

Implementation

Our goal today is to implement a C++ version of the k-means algorithm that successfully clusters a two-dimensional subset of the famous mall customers dataset (available here). It should be noted that the k-means algorithm certainly works in more than two dimensions (the Euclidean distance metric easily generalises to higher dimensional space), but for the purposes of visualisation, this post will only implement k-means to cluster 2D data. A plot of the raw data is shown below:

By eye, it seems that there are five different clusters. The question is whether our k-means algorithm can successfully figure this out. We are actually going to cheat a little bit and tell the algorithm that there will be five clusters (i.e $k=5$). There are methods to avoid this, but they essentially involve testing different values of k and finding the best fit, so they don’t add much value to this post.

C++ preambles

Firstly, we need to define our imports and namespace.

#include <ctime>     // for a random seed
#include <fstream>   // for file-reading
#include <iostream>  // for file-reading
#include <sstream>   // for file-reading
#include <vector>

using namespace std;

In general, using namespace std is not considered best practice (particularly in larger projects) because it can lead to ambiguity (for example, if I define a function or variable called vector) and unexpected behaviour. However, the alternative is to have things like std::cout or vector::vector everywhere – for an educational post, the loss in clarity is worse than the potential ambiguity.

Representing a datapoint

To represent a datapoint for this program, we will be using a C++ struct. Structs caused me a great deal of confusion when I was learning about C++ because I couldn’t quite figure out how they differ from classes. As it happens, they are really quite similar – possibly the only relevant difference is that members of a struct are public by default. In any case, I would think of a struct as a way of defining a more complicated data type, though it is more than just a container for primitive datatypes because you can also define some functionality.

struct Point {
    double x, y;     // coordinates
    int cluster;     // no default cluster
    double minDist;  // default infinite dist to nearest cluster

    Point() : 
        x(0.0), 
        y(0.0),
        cluster(-1),
        minDist(__DBL_MAX__) {}
        
    Point(double x, double y) : 
        x(x), 
        y(y),
        cluster(-1),
        minDist(__DBL_MAX__) {}

    double distance(Point p) {
        return (p.x - x) * (p.x - x) + (p.y - y) * (p.y - y);
    }
};

The first few lines are self-explanatory: we define the coordinates of a point, as well as the cluster it belongs to and the distance to that cluster. Annoyingly, you can’t directly set the default value in the struct (e.g double x = 0) – you need to do this via initialisation lists. Initially the point belongs to no cluster, so we arbitrarily set that to -1. Accordingly, we must set minDist to infinity (or the next best thing, __DBL_MAX__).

We also define a distance function, which computes the (square) euclidean distance between this point and another. Our Point struct can be used as follows:

// Define new point at the origin
Point p1 = Point(0.0, 0.0);
cout << p1.x << endl;  // print the x coordinate

// Define another point and compute square distance
Point p2 = Point(3.0, 4.0);
cout << p1.distance(p2) << endl;  // prints 25.0

If we wanted to represent a datapoint in p-dimensions, we could replace the x and y members with a vector or array of doubles, with each entry corresponding to a coordinate in a given dimension. The distance function would similarly need to be modified to loop over the vectors/arrays and sum all of the squared differences.

Reading in data from a file

Having decided how we are going to store datapoints within our C++ script, we must then read in the data from a CSV file. This is rather unexciting, but actually took me a long time to figure out. Essentially, we loop over all the lines in the CSV file and break the down based on the commas.

vector<Point> readcsv() {
    vector<Point> points;
    string line;
    ifstream file("mall_data.csv");

    while (getline(file, line)) {
        stringstream lineStream(line);
        string bit;
        double x, y;
        getline(lineStream, bit, ',');
        x = stof(bit);
        getline(lineStream, bit, '\n');
        y = stof(bit);

        points.push_back(Point(x, y));
    }
    return points;
}

Note that the readcsv function returns a vector of points. I decided to use a vector instead of an array because vectors handle all of the memory management for you (though are slightly less performant) and are functionally quite similar to python lists.

Pointers: an old enemy revisited

Suppose your friend wants to visit your house. You have two options (the relevance of this thought experiment will be clear shortly.)

  1. Give them your postcode and let them find your house.
  2. Hire a team of builders to replicate your house brick-for-brick right outside their front door.

The readcsv function returns a vector of points. One might assume that we can then just pass this to whatever k-means function we define and be done with it.

vector<Point> points = readcsv();  // read from file
kMeansClustering(points);          // pass values to function

However, we must be aware that depending on the size of our dataset, points might take up quite a large chunk of memory, so we must handle it carefully to be efficient. The problem with the above code is that we are passing the values of the points to the function, i.e we are making a copy of them. This is inefficient from a memory perspective. Luckily, C++ offers a way around this, called pass by reference. Essentially, instead of giving the value of the points vector to the function, we pass the location (read: postcode) of the points vector in memory.

vector<Point> points = readcsv(); 
kMeansClustering(&points); // pass address of points to function

The prototype of our kMeansClustering function is then as follows:

void kMeansClustering(vector<Point>* points, int epochs, int k);

Because we are now passing an address (and thus not technically a vector<Point>), we must include an asterisk. Read the first argument as “a reference to a vector of Point objects”.

I have also added two other arguments:

  • epochs is the number of iterations over which we will do our main k-means loop
  • k is the number of clusters.

Initialising the clusters

We first need to assign each point to a cluster. The easiest way of doing this is to randomly pick 5 “marker” points and give them labels 1-5 (or actually 0-4 since our arrays index from 0).

The code for this is quite simple. We will use another vector of points to store the centroids (markers), where the index of the centroid is its label. We then select a random point from the points vector we made earlier (from reading in the csv) and set that as a centroid.

vector<Point> centroids;
srand(time(0));  // need to set the random seed
for (int i = 0; i < k; ++i) {
    centroids.push_back(points->at(rand() % n));
}

One brief C++ note: because points is actually a pointer rather than a vector, in order to access an item at a certain index we can’t do points[i] – we have to first ‘dereference’ it by doing (*points)[i]. This is quite ugly, so fortunately we have the syntactic shortcut of: points->at[i].

Once the centroids have been initialised, we can begin the k-means algorithm iterations. We now turn to the “meat” of k-means: assigning points to a cluster and computing new centroids.

Assigning points to a cluster

The logic here is quite simple. We loop through every datapoint and assign it to its nearest centroid. Because there are k centroids, the result is a partition of the datapoints into k clusters.

In terms of the actual code, I had to spend some time thinking about the best way to represent that a point belonged to a certain cluster. In my python implementation (now many years old), I used a dictionary with cluster IDs as keys and a list of points as values. However, for this program I decided to use a quicker solution: I gave each point a cluster attribute which can hold an integer ID. We then set this ID to the index of the cluster that is closest to the point.

for (vector<Point>::iterator c = begin(centroids); 
     c != end(centroids); ++c) {
    // quick hack to get cluster index
    int clusterId = c - begin(centroids);

    for (vector<Point>::iterator it = points->begin();
         it != points->end(); ++it) {
             
        Point p = *it;
        double dist = c->distance(p);
        if (dist < p.minDist) {
            p.minDist = dist;
            p.cluster = clusterId;
        }
        *it = p;
    }
}

Computing new centroids

After our first iteration, the clusters are really quite crude – we’ve randomly selected 5 points then formed clusters based on the closest random point. There’s no reason why this should produce meaningful clusters and indeed it doesn’t. However, the heart of k-means is the update step, wherein we compute the centroids of the previous cluster and subsequently reassign points.

As previously stated, we are going to majorly simplify the problem by computing the centroid of the points within a cluster rather than the partition of space. Thus all we really have to do is compute the mean coordinates of all the points in a cluster.

To do this, I created two new vectors: on to keep track of the number of points in each cluster and the other to keep track of the sum of coordinates (then the average is just the latter divided by the former).

vector<int> nPoints;
vector<double> sumX, sumY;

// Initialise with zeroes
for (int j = 0; j < k; ++j) {
    nPoints.push_back(0);
    sumX.push_back(0.0);
    sumY.push_back(0.0);
}

We then iterate through all the points and increment the correct indices of the above vectors (based on the point’s cluster ID). Importantly, now is a convenient time to reset the minDist attribute of the point, so that the subsequent iteration works as intended.

// Iterate over points to append data to centroids
for (vector<Point>::iterator it = points->begin(); 
     it != points->end(); ++it) {
    int clusterId = it->cluster;
    nPoints[clusterId] += 1;
    sumX[clusterId] += it->x;
    sumY[clusterId] += it->y;

    it->minDist = __DBL_MAX__;  // reset distance
}

// Compute the new centroids
for (vector<Point>::iterator c = begin(centroids); 
     c != end(centroids); ++c) {
    int clusterId = c - begin(centroids);
    c->x = sumX[clusterId] / nPoints[clusterId];
    c->y = sumY[clusterId] / nPoints[clusterId];
}

Now that we have the new centroids, the k-means algorithm repeats. We recompute distances and reassign points to their nearest centroids. Then we can find the new centroids, recompute distances etc..

Writing to a file

One final detail: after all of our k-means iterations, we would like to be able to write the output to a file so that we can analyse the clustering. This is quite simple - we will just iterate through the points then print their coordinates and cluster IDs to a csv file.

ofstream myfile;
myfile.open("output.csv");
myfile << "x,y,c" << endl;

for (vector<Point>::iterator it = points->begin(); 
     it != points->end(); ++it) {
    myfile << it->x << "," << it->y << "," << it->cluster << endl;
}
myfile.close();

Testing

In order to test that my k-means implementation was working properly, I wrote a simple plotting script. I am somewhat embarrassed (in the context of a C++ post) to say that I wrote this in python.

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Before clustering
df = pd.read_csv("mall_data.csv", header=None)
df.columns = ["Annual income (k$)", "Spending Score (1-100)"]
sns.scatterplot(x=df["Annual income (k$)"], 
                y=df["Spending Score (1-100)"])
plt.title("Scatterplot of spending (y) vs income (x)")

# After clustering
plt.figure()
df = pd.read_csv("output.csv")
sns.scatterplot(x=df.x, y=df.y, 
                hue=df.c, 
                palette=sns.color_palette("hls", n_colors=5))
plt.xlabel("Annual income (k$)")
plt.ylabel("Spending Score (1-100)")
plt.title("Clustered: spending (y) vs income (x)")

plt.show()

The result is quite pretty and it shows that – bar a few contentious points around the centre cluster – the clustering has worked as expected.

Conclusion

In conclusion, we have successfully implemented a simple k-means algorithm in C++. Obviously there is much that could be improved about my program. Firstly, many simplifications were made, for example, we restricted the problem to two dimensions and also pre-set the number of clusters. However, there are more subtle issues that we neglected to discuss, including the random initialisation which may result in suboptimal clusters. In fact, there are algorithms like k-means++ that offer major improvements over k-means by specifying better procedures to find the initial clusters.

It is also worth mentioning the fundamental difficulties with k-means: it acutely suffers from the ‘curse of dimensionality’, as data becomes more sparse in high dimensions, and it is relatively inefficient since there are four loops (over iterations, points, clusters, and dimensions). However, k-means is often a great solution for quickly clustering small data and the algorithm is just about simple enough to explain to business stakeholders.

In any case, the merits/disadvantages of k-means aside, writing this program has given me a lot more confidence in C++ and I am keen to develop a more advanced understanding. I think it’s a good complement to my current interests in scientific/financial computing and it is pleasing to see that I am making more progress than I was a few years back.


What we learnt building an enterprise-blockchain startup

It has been almost a year since the idea of HyperVault was first conceived. In that time, we built HyperVault up from a single sentence, gained and lost team members along the way, developed a functional proof-of-concept over the short winter holidays, crashed out of a few competitions (also won a couple of prizes), and finally decided to open source. This post aims to be an honest reflection on the journey – highlighting both the good bits and the times we wanted to give up.

What is HyperVault?

At a high level, the idea behind HyperVault is to use distributed ledger technology (we preferred to avoid the B-word) to provide a secure access layer to sensitive digital resources. If your company has some super-secret documents, what’s the best way of sharing them with other people? From our market research, we found that most people would just chuck it into an email as an attachment. This is terrible from a security standpoint: system administrators can see everything that goes through and you have absolutely no access record or control whatsoever once the email goes into your ‘sent’ mailbox. HyperVault provides a provably tamper-proof system with smart sharing features, such as scheduled or location-based access.

The Beginning: October-November 2018

HyperVault was born at a Hackbridge cluster – a fortnightly meet-up for Cambridge students who want to build side projects. Li Xi had been thinking about the applications of private blockchains in file sharing and met Robert and a group of other undergrads who were interested in joining the team.

By far the hardest part of our HyperVault journey was balancing our desire to build something with the Cambridge term structure. An 8-week term, packed with lectures and supervisions, doesn’t give you much time for entrepreneurial activities. However, we were excited enough about the idea to have weekly team meetings in a comfortable room at Trinity College, where we focused on defining what exactly we wanted the product to be.

At the same time, we were on a constant lookout for startup competitions that we could enter. Fortunately, Li Xi and Robert were also on the Cambridge Blockchain Society – one of whose sponsors (Stakezero Ventures) was launching the Future of Blockchain competition. This proved to be critical to HyperVault’s progress because it gave us a “fixed point” on the timeline – whatever happened, we needed to have a working product and polished pitch by March 2019, when the competition was scheduled to happen.

Winter, literally and figuratively.

December was tough. We had to transition from talking at a high level about what kind of features we wanted to offer, how cool it’d be if we could do XYZ, into actually building a working prototype. To ensure that everyone was on the same page regarding the expected time commitments, in the last couple of weeks of term, we made a detailed project plan that outlined the deliverables, due dates, and expected time commitments (we concluded that it’d be at least 12 hours a week each).

Three team members dropped out at this point. There weren’t any hard feelings – it is understandable for people to have conflicting priorities and it is better that they can say so upfront. This left Li Xi and Robert the considerable task of building a proof of concept, with Andrew taking the lead on the business plan. With the reduced manpower (and a good helping of classic underestimation), the weekly commitment was closer to 30 hours. It was a real hustle having to balance this with academics (and of course, winter festivities), but by the end of the holidays, we had our proof-of-concept. The main lesson to be drawn from this is the importance of having that “hard conversation” with your team. Give people the option to call it quits at the start, but have a “no ifs, no buts” understanding that come what may, your prototype/MVP must be done by a certain date.

The build-up to the competition (January – March)

At this stage, our proof-of-concept was something like a Google Drive or Dropbox, except with a blockchain backend. We were on track, so moved into “Phase 2” – doing real market research and nailing down a business plan. Andrew tapped into his network at the Judge Business School to get feedback from his peers, which was incredibly useful. One of them mentioned that our product offering was quite similar to something they had used in a previous job. This was very worrying, and after some further googling we were deeply troubled to find out that there was a whole class of products – virtual data rooms – that were doing exactly what we had offered. Robert and Li Xi were a little discouraged by this, but Andrew reminded us about a very important fact of entrepreneurship: you never have to be first, you just have to do things better.

Additionally, one of the oft-repeated pieces of advice from successful entrepreneurs is the need to focus on a very specific niche (a ‘vertical’) rather than trying to build a one-size-fits-all product. We saw that the incumbents had dominated the market for financial companies, so we decided to devote our attention to legal companies. With this in mind, we sat down as a team to really think through our business plan – we found the concept of a ”lean business canvas” enthralling – its almost scientific approach to iterating your product via hypothesis and experimentation really appealed to us (indeed, all three of us have a STEM background).

Competition time (March, April)

Coming up on March, our academic commitments were noticeably picking up. But there was no time to waste, as it was competition-season. We agreed as a team that for any competition, there would need to be at least two of us there (three was unrealistic).

The first competition we entered did not go so well – we failed to convince the judges, who traditionally favoured biotech ideas, that blockchain was anything more than a fad. Despite making it to the finals, it was clear that there was very little engagement with our idea. At the time, we blamed the judges for being too closed-minded, but in retrospect, we see that it’s really our responsibility in the pitch to sell the vision we have. Nevertheless, we used the feedback to refine our business plan and tidy up our proof-of-concept.

The main competition in March, however, was much more engaging and challenging (more than 100 teams participated). Due to a last-minute change of competition dates, Li Xi cut short his travel plans and flew back from Geneva for the Finals – luckily, he made it back just in time for the pitch with only a few hours to spare and the pitch went quite smoothly. The judges were intrigued by the idea and asked us some particularly challenging technical questions regarding encryption, which Li Xi dealt with deftly. Coming out of it, we won the £2000 Future of Blockchain Nucypher prize, which was a strong vote of confidence.

Q&A at the Future of Blockchain Finals

The last competition we attended was the R3 Global Pitch Competition. R3 is an enterprise blockchain software firm working with a large ecosystem of more than 300 of the worlds’ largest companies. Even before the submission of our deck, R3 provided excellent guidance by connecting us to their legal teams and having numerous calls with us to refine our business plan, for which we were immensely grateful. We did well in the competition, winning a place in the global finals and subsequently being offered free office space at R3’s office in London to continue building HyperVault.

The end?

With this opportunity in mind, we were talking to VCs about the idea of scaling up. In particular, we had good chemistry with the folks at Stakezero and Wilbe Ventures, and they gave us a lot of very thoughtful advice, emphasising the importance of not just identifying a vertical, but also targeting the size of the customer and the key stakeholders in a company that could make things happen. We hadn’t put too much thought into this before; I guess we were implicitly making the naïve assumption that if we built a cool product, it would sell itself. Up until that point, in our team meetings, we had just said that we’d target “SMEs” (small and medium-sized enterprises). But after a lot more research and discussion, we came to the troubling realisation that our ideal customer was, in fact, an “elephant” – an organisation that would spend more than $100,000 a year on our product (see this excellent short blog post for more) – they are the only ones who have 1) enough secure documents and 2) a need to share these documents with scalability.

Image from Christopher Janz's blog

Regardless of our successes in talking with smaller companies, big companies are an entirely different ball game. You need enterprise sales teams and a polished, audited product. In principle, there was nothing stopping us. We could have used R3’s office space to turn our proof of concept into a sleek MVP and earnestly begin the search for venture capital on the back of that. But ultimately, none of us felt particularly excited by this prospect. A startup is hard work – many hours spent fixing bugs, refining pitches, cold calling hundreds of companies, etc. The only thing keeping it together is the deep desire to build something and a fundamental belief that your product can “change the world”. Faced with the prospect of an enterprise sales cycle and a product which is really an improvement over current technology instead of something brand new (don’t get us wrong, we do still think it’s a major improvement), we realised that HyperVault as a commercial startup had run its course, at least until we had more experience.

What comes next

We truly believe that distributed ledgers applied to file-sharing could be a significant improvement over current technologies. To that end, we decided as a team that the best way forward would be to open-source our codebase, such that it becomes an example of a real-world enterprise blockchain app for others to build on.

We published most of our code on GitHub a few months ago and have already had someone reach out to us expressing interest in incorporating HyperVault into their product. We plan to continue cleaning up the codebase within the next couple of months, including comprehensive documentation and a contributors’ guide. Do leave a clap or comment below if you think this is something we should devote more time to!

In any case, rather than inserting some banal Edison quote about failure, we’d like to end this post simply by saying that we are extremely grateful for the friendships we made, the people we met, and the chance to tell ourselves that we built something from nothing.


This post has been cross-posted on medium. Check out our website at hypervault.tech to learn more!


Graph algorithms and currency arbitrage, part 2

In the previous post (which should definitely be read first!) we explored how graphs can be used to represent a currency market, and how we might use shortest-path algorithms to discover arbitrage opportunities. Today, we will apply this to real-world data. It should be noted that we are not attempting to build a functional arbitrage bot, but rather to explore how graphs could potentially be used to tackle the problem. Later on we’ll discuss why our methodology is unlikely to result in actionable arbitrage.

Rather than using fiat currencies as presented in the previous post, we will examine a market of cryptocurrencies because it is much easier to acquire crypto order book data. We’ll narrow down the problem further by making two more simplifications. Firstly, we will focus on arbitrage within a single exchange. That is, we’ll look to see if there are pathways between different coins on an exchange which leave us with more of a coin than we started with. Secondly, we will only be considering a single snapshot of data from the exchange. Obviously markets are highly dynamic, with thousands of new bids and asks coming in each second. A proper arbitrage system needs to constantly be scanning for opportunities, but that’s out of the scope of this post.

With all this in mind, the overall implementation strategy was as follows:

  1. For a given exchange, acquire the list of pairs that will form the vertices.
  2. For each of these pairs, download a snapshot of the bid/ask.
  3. Process these values accordingly, assigning them to directed edges on the graph.
  4. Using Bellman-Ford, find and return negative-weight cycles if they exist.
  5. Calculate the arbitrage that these negative-weight cycles correspond to.

The full code for this project can be found in this GitHub repo. If you find this post interesting, don’t forget to leave a star!

Star

Raw data

For the raw data, I decided to use the CryptoCompare API which has a load of free data compiled across multiple exchanges. To get started, you’ll need to register to get a free API key.

As mentioned previously, we will only look at data from Binance. I chose Binance not because it has a large selection of altcoins, but because most altcoins can trade directly with multiple pairs (e.g BTC, ETH, USDT, BNB). Some exchanges have many altcoins but you can only buy them with BTC – this is not well suited for arbitrage.

Firstly, we need to find out which pairs Binance offers. This is done with a simple call (AUTH is your API key string):

import requests
import json 

def top_exchange_pairs():
    url = (
        "https://min-api.cryptocompare.com/data/v3/all/" + 
        "exchanges?topTier=true&api_key=" + AUTH
    )
    r = requests.get(url)
    with open("pairs_list.json", "w") as f:
        json.dump(r.json(), f)

This is an excerpt from the resulting JSON file – for each exchange, the pairs field lists all other coins that the key coin can be traded with:

"Data":{  
   "Binance":{  
      "isActive":true,
      "isTopTier":true,
      "pairs":{  
         "ETH":["PAX", "TUSD", "USDT", "USDC", "BTC"],
         "ONGAS":["BTC", "BNB", "USDT"],
         "PHX":["ETH","BNB","BTC"]
      }
   },
   "Coinbase":{  
      "isActive":true,
      "isTopTier":true,
      "pairs":{  
         "ETH":["DAI", "USD", "USDC", "EUR", "GBP", "BTC"],
         "BCH":["BTC", "GBP", "EUR", "USD"]
      }
   }
}

I then filtered out coins with fewer than three tradable pairs. These coins are unlikely to participate in arbitrage – we would rather have a graph that is more connected.

def binance_connected_pairs():
    with open("pairs_list.json", "r") as f:
        data = json.load(f)
    pairs = data["Data"]["Binance"]["pairs"]
    return {k: v for k, v in pairs.items() if len(v) > 3}

We are now ready to download a snapshot of the available exchange rates for each of these coins.

import os
import tqdm  # progress bar

def download_snapshot(pair_dict, outfolder):
    if not os.path.exists(outfolder):
        os.makedirs(outfolder)

    # Download data and write to files
    for p1, p2s in tqdm(pair_dict.items()):
        url = (
            "https://min-api.cryptocompare.com/data/" 
            + f"ob/l1/top?fsyms={p1}&tsyms={','.join(p2s)}"
            + "&e=Binance&api_key=" + AUTH
        )
        r = requests.get(url)
        with open(f"{outfolder}/{p1}_pairs_snapshot.json", "w") as f:
            json.dump(r.json(), f)

We can then run all of the above functions to produce a directory full of the exchange rate data for the listed pairs.

top_exchange_pairs()
connected = binance_connected_pairs()
download_snapshot(connected, "binance_data")
"EOS": {
    "BNB": {
        "BID": ".2073",
        "ASK": ".2077"
    },
    "BTC": {
        "BID": ".0007632",
        "ASK": ".0007633"
    },
    "ETH": {
        "BID": ".02594",
        "ASK": ".025964"
    },
    "USDT": {
        "BID": "7.0441",
        "ASK": "7.046"
    },
    "PAX": {
        "BID": "7.0535",
        "ASK": "7.07"
    }
}

This excerpt reveals something that we glossed over completely in the previous post. As anyone who has tried to exchange currency on holiday will know, there are actually two exchange rates for a given currency pair depending on whether you are buying or selling the currency. In trading, these two prices are called the bid (the current highest price someone will buy for) and the ask (the current lowest price someone will sell for). As it happens, this is very easy to deal with in the context of graphs.

Preparing the data

Having downloaded the raw data, we must now prepare it so that it can be put into a graph. This effectively means parsing it from the raw JSON and putting it into a pandas dataframe. We will arrange it in the dataframe such that it constitutes an adjacency matrix:

  • Column ETH row BTC is the bid:
    • i.e someone will pay x BTC to buy my 1 ETH
    • this is then the weight of the ETH $\to$ BTC edge.
  • Column BTC row ETH is the ask:
    • i.e I have to pay y BTC to buy someone’s 1 ETH
    • the reciprocal of this is the weight of the BTC $\to$ ETH edge.

I chose this particular row-column scheme because it results in intuitive indexing: df.X.Y is the amount of Y gained by selling 1 unit of X, and df.A.B * df.B.C * df.C.D is the total amount of D gained by trading 1 unit of A when trading via $A \to B \to C \to D$.

The column headers will be the same as the row headers, consisting of all the coins we are considering. The function that creates the adjacency matrix is shown here:

def create_adj_matrix(pair_dict, folder, outfile="snapshot.csv"):
    # Union of 'from' and 'to' pairs
    flatten = lambda l: [item for sublist in l for item in sublist]
    keys, vals = pair_dict.items()
    all_pairs = list(set(keys).union(flatten(values)))

    # Create empty df
    df = pd.DataFrame(columns=all_pairs, index=all_pairs)

    for p1 in pair_dict.keys():
        with open(f"{folder}/{p1}_pairs_snapshot.json", "r") as f:
            res = json.load(f)
        quotes = res["Data"]["RAW"][p1]
        for p2 in quotes:
            try:
                df[p1][p2] = float(quotes[p2]["BID"])
                df[p2][p1] = 1 / float(quotes[p2]["ASK"])
            except KeyError:
                print(f"Error for {p1}/{p2}")
                continue
    df.to_csv(outfile)

Putting the data into a graph

We will be using the NetworkX package, an intuitive yet extremely well documented library for dealing with all things graph-related in python.

In particular, we will be using nx.DiGraph, which is just a (weighted) directed graph. I was initially concerned that it’d be difficult to get the data in: python libraries often adopt their own weird conventions and you have to modify your data so that is in the correct format. This was not really the case with NetworkX, it turns out that we already did most of the hard work when we put the data into our pandas adjacency matrix.

Firstly, we take negative logs as discussed in the previous post. Secondly, in our dataframe we currently have NaN whenever there is no edge between two vertices. To make a valid nx.DiGraph, we need to set these to zero. Lastly, we transpose the dataframe because NetworkX uses a different row/column convention. We then pass this processed dataframe into the nx.Digraph constructor. Summarised in one line:

g = nx.DiGraph(-np.log(df).fillna(0).T)

Bellman-Ford

To implement Bellman-Ford, we make use of the funky defaultdict data structure. As the name suggests, it works exactly like a python dict, except that if you query a key that is not present you get a certain default value back. The first part of our implementation is quite standard, as we are just doing the $n - 1$ edge-relaxations where n is the number of vertices.

But because the ‘classic’ Bellman-Ford does not actually return negative-weight cycles, the second part of our implementation is a bit more complicated. The key idea is that if after $n-1$ relaxations, there is an edge that can be relaxed further then that edge must be on a negative weight cycle. So to find this cycle we walk back along the predecessors until a cycle is detected, then return the cyclic portion of that walk. In order to prevent subsequent redundancy, we mark these vertices as ‘seen’ via another defaultdict. This procedure adds a linear cost to Bellman-Ford since we have to iterate over all the edges, but the asymptotic complexity overall remains $O(VE)$.

def bellman_ford_return_cycle(g, s):
    n = len(g.nodes())
    d = defaultdict(lambda: math.inf)  # distances dict
    p = defaultdict(lambda: -1)  # predecessor dict
    d[s] = 0

    for _ in range(n - 1):
        for u, v in g.edges():
            # Bellman-Ford relaxation
            weight = g[u][v]["weight"]
            if d[u] + weight < d[v]:
                d[v] = d[u] + weight
                p[v] = u  # update pred

        # Find cycles if they exist
        all_cycles = []
        seen = defaultdict(lambda: False)

        for u, v in g.edges():
            if seen[v]:
                continue
            # If we can relax further there must be a neg-weight cycle
            weight = g[u][v]["weight"]
            if d[u] + weight < d[v]:
                cycle = []
                x = v
                while True:
                    # Walk back along preds until a cycle is found
                    seen[x] = True
                    cycle.append(x)
                    x = p[x]
                    if x == v or x in cycle:
                        break
                # Slice to get the cyclic portion
                idx = cycle.index(x)
                cycle.append(x)
                all_cycles.append(cycle[idx:][::-1])
        return all_cycles

As a reminder, this function returns all negative-weight cycles reachable from a given source vertex (returning the empty list if there are none). To find all negative-weight cycles, we can simply call the above procedure on every vertex then eliminate duplicates.

def all_negative_cycles(g):
    all_paths = []
    for v in g.nodes():
        all_paths.append(bellman_ford_negative_cycles(g, v))
    flattened = [item for sublist in all_paths for item in sublist]
    return [list(i) for i in set(tuple(j) for j in flattened)]

Tying it all together

The last thing we need is a function that calculates the value of an arbitrage opportunity given a negative-weight cycle on a graph. This is easy to implement: we just find the total weight along the path then exponentiate the negative total (because our weights are the negative log of the exchange rates).

def calculate_arb(cycle, g, verbose=True):
    total = 0
    for (p1, p2) in zip(cycle, cycle[1:]):
        total += g[p1][p2]["weight"]
    arb = np.exp(-total) - 1
    if verbose:
        print("Path:", cycle)
        print(f"{arb*100:.2g}%\n")
    return arb


def find_arbitrage(filename="snapshot.csv"):
    df = pd.read_csv(filename, header=0, index_col=0)
    g = nx.DiGraph(-np.log(df).fillna(0).T)

    if nx.negative_edge_cycle(g):
        print("ARBITRAGE FOUND\n" + "=" * 15 + "\n")
        for p in all_negative_cycles(g):
            calculate_arb(p, g)
    else:
        print("No arbitrage opportunities")

Running this function gives the following output:

ARBITRAGE FOUND
===============

Path: ['USDT', 'BAT', 'BTC', 'BNB', 'ZEC', 'USDT']
0.087%

Path: ['BTC', 'XRP', 'USDT', 'BAT', 'BTC']
0.05%

Path: ['BTC', 'BNB', 'ZEC', 'USDT', 'BAT', 'BTC']
0.087%

Path: ['BNB', 'ZEC', 'USDT', 'BAT', 'BTC', 'BNB']
0.087%

Path: ['USDT', 'BAT', 'BTC', 'XRP', 'USDT']
0.05%

0.09% is not exactly a huge amount of money, but it is still risk-free profit, right?

Why wouldn’t this work?

Notice that we haven’t mentioned exchange fees at any point. In fact, Binance charges a standard 0.1% commission on every trade. It is easy to modify our code to incorporate this: we just multiply each exchange rate by 0.999. But we don’t need to compute anything to see that we would certainly be losing much more money than gained from the arbitrage.

Secondly, it is likely that this whole analysis is flawed because of the way the data was collected. The function download_snapshot makes a request for each coin in sequence, taking a few seconds in total. But in these few seconds, prices may move – so really the above “arbitrage” may just be a result of our algorithm selecting some of the price movements. This could be fixed by using timestamps provided by the exchange to ensure that we are looking at the order book for each pair at the exact same moment in time.

Thirdly, we have assumed that you can trade an infinite quantity of the bid and ask. An order consists of a price and a quantity, so we will only be able to fill a limited quantity at the ask price. Thus in practice we would have to look at the top few levels of the order book and consider how much of it we’d eat into.

It is not difficult to extend our methodology to arb between different exchanges. We would just need to aggregate the top of the order book from each exchange, then put the best bid/ask onto the respective edges. Of course, to do run this strategy live would require us to manage our inventory not just on a currency level but per currency per exchange, and factors like the congestion of the bitcoin network would come into play.

Lastly, this analysis has only been for a single snapshot. A proper arbitrage bot would have to constantly look for opportunities simultaneously across multiple order books. I think this could be done by having a websocket stream which keeps the graph updated with the latest quotes, and using a more advanced method for finding negative-weight cycles that does not involve recomputing the shortest paths via Bellman-Ford.

Conclusion

All this begs the question: why is it so hard to find arbitrage? The simple answer is that other people are doing it smarter, better, and (more importantly) faster. With highly optimised algorithms (probably implemented in C++), ‘virtual colocation’ of servers, and proper networking software/hardware, professional market makers are able to exploit these simple arbitrage opportunities extremely rapidly.

In any case, the point of this post was not to develop a functional arbitrage bot but rather to demonstrate the power of graph algorithms in a non-standard use case. Hope you found it as interesting as I did!