Graph algorithms and currency arbitrage, part 1

Arbitrage is the holy grail for traders and the bedrock of financial academia. Let’s say you are in an open marketplace with Alice selling oranges for \$1 each and Bob buying them for \$2. As a cunning trader, you realise you can buy an orange from Alice and immediately sell it to Bob for \$1 of “risk-free” profit. However, as you keep reaping this \$1 profit by buying up Alice’s oranges, she raises her prices. Eventually, Alice would be charging \$2 per orange and the arbitrage would be shut. In other words, acting on arbitrage opportunities causes the market to become more efficient, with everything as close to its fair price as possible. Because of this, the only way to profit from true arbitrage is to find and execute opportunities faster than everybody else.

Currency exchange is a natural place to search for arbitrage because there are many different pairs that we can trade. For example, assume we are able to convert GBP to USD, then USD to EUR, then EUR to AUD for an effective rate of 1:2 GBP:AUD. If we were then able to directly swap AUD back to GBP in the ratio 2:1.1, this would be an arbitrage opportunity because we end up with more GBP than we started with while taking zero risk.

Our goal is to develop a systematic method for detecting arbitrage opportunities by framing the problem in the language of graphs. This is a reasonably large topic, so we’ll split it into two parts. Part 1 (this post) will lay the theoretical groundwork, introducing graph algorithms and giving an overview of their application to currency arbitrage. In Part 2 we will present an actual implementation of these ideas in python, applied to cryptocurrencies.

Graphs

Formally, we define a graph as a set of vertices (also called nodes) and edges. These edges can be directed (i.e have little arrows on them) and may have weights. It is completely up to us to decide what the vertices, edges, and weights represent. For example, if we are trying to model a road network we might say that the vertices are cities (or junctions), the edges are the roads themselves, and the weights are the length of the roads. The diagram below shows how a weighted directed graph is typically depicted.

It is useful to formulate real-world problems in terms of graphs because we have about three centuries worth of relevant theory – they were first investigated by Euler in 1736. In particular, there exist many efficient algorithms related to finding the shortest path along a graph, which have widespread applications e.g in mapping.

The Bellman-Ford algorithm finds the minimum weight path from a single source vertex to all other vertices on a weighted directed graph. For the purposes of this post, we don’t need to know anything other than the fact that once the algorithm terminates, each vertex is labelled with the total weight of the minimum weight path from the source to that vertex.

Actually, the above statement has an incredibly important caveat. What happens if we try to find the minimum weight path from vertices C to D in the above graph? The obvious guess would be that it is the direct path along the CD edge, which has total weight 2.1. However, notice that we can in fact reduce this further by following the path CDECD, which would have a total weight of -1. But why stop there? If we go around the loop again, we can reduce it stil further. In fact, each time we traverse the loop, our minimum total weight will reduce by 1. Hence this is called a negative-weight cycle, the existence of which means that the shortest path between C and D is not defined.

This is actually the main reason why we are choosing to use Bellman-Ford over another shortest path algorithm like Dijkstra, despite the latter being asymptotically more efficient. Bellman-Ford is able to detect if there is a negative-weight cycle and as will be seen shortly, this will be the key to detecting arbitrage opportunities.

Graphs for currency arbitrage

How can graphs be used to model currency markets? At a very high level, we will assign currencies to different vertices, and let the edge weight represent the exchange rate. A simple example is presented below: in this market, we can convert 1 GBP to 1.27 USD, 1 USD to 1.43 AUD, etc.

To find the exchange rate between two currencies that do not share an edge, we simply find a path between the two currency vertices and walk along it, multiplying by each successive edge weight (exchange rate). If there are multiple possible paths, choose the path that has the greatest weight product, which corresponds to the most favourable rate.

What would happen if we start with 1 GBP and convert it along GBP $\to$ USD $\to$ AUD $\to$ GBP? Well, multiplying out the edges we see that our initial 1 GBP becomes $1.27 \times 1.43 \times 0.55 \approx 0.999$ GBP. That is to say, we lose a tenth of a penny on the round trip. However, if this number were greater than 1, it would mean that by following this path we end up with more GBP than we started with – this is textbook arbitrage.

So the problem of finding arbitrage can thus be formally posed as follows: find a cycle in the graph such that multiplying all the edge weights along that cycle results in a value greater than 1. In fact we have already described an algorithm that can find such a path – the problem is equivalent to finding a negative-weight cycle, provided we do some preprocessing on the edges.

Firstly, we note that Bellman-Ford computes the path weight by adding the individual edge weights. To make this work for exchange rates, which are multiplicative, an elegant fix is to first take the logs of all the edge weights. Thus when we sum edge weights along a path we are actually multiplying exchange rates – we can recover the multiplied quantity by exponentiating the sum. Secondly, Bellman-Ford attempts to find minimum weight paths and negative edge cycles, whereas our arbitrage problem is about maximising the amount of currency received. Thus as a simple modification, we must also make our log weights negative.

With these two tricks in hand, we are able to apply Bellman-Ford. The minimal weight between two currency vertices corresponds to the optimal exchange rate, the value of which can be found by by exponentiating the negative sum of weights along the path. As a corollary, we have the wonderful insight that: a negative-weight cycle on the negative-log graph corresponds to an arbitrage opportunity.

Conclusion

We have seen that graphs provide an elegant framework for thinking about currency exchange. Concretely, by representing different currencies as vertices and using the negative log of the exchange rate as the edge weight we can apply shortest-path algorithms to find negative-weight cycles if they exist, which correspond to arbitrage opportunities. In the next post, we code up the methodology presented in this post and apply it to some real-world data. Should be fun!


Portfolio optimisation: lessons learnt

Over the past few months I have been busy doing a mixture of blockchain consulting and quantitative finance work for a couple of companies in South East Asia. In particular, I have had the opportunity to investigate the interesting problem of portfolio management for cryptoassets – it was not my first experience with portfolio optimisation, having implemented efficient frontier portfolios at a roboadvisor startup, but this time I took the opportunity to do a deep dive into the subject.

Though I am not an expert, I do believe I am at the stage where I can do a kind of write-up, recounting some of the things I’ve learnt along the way. A lot of my experience has been baked into PyPortfolioOpt, a comprehensive portfolio optimisation package which I recently revamped, but this post will act as an informal guidebook for those interested in pursing the subject.

1. Be careful with in-sample vs out-of- testing

For most people, ‘portfolio optimisation’ basically refers to mean-variance optimisation (MVO) in the style of Markowitz. MVO will tell you the mathematically optimal portfolio allocation, which sounds great, but the caveat is that it requires the (future) expected returns and the covariance matrix as inputs. For those of us who aren’t clairvoyant, these quantities are inaccessible, so we must make do with estimates. The quality of these estimates will be discussed in the next section, but for now I would like to highlight a common mistake, which is backtesting with a forward-looking bias.

The standard way (in textbooks or pedagogical materials) to estimate the expected returns is to take the mean annual return over the past few years. However, it is surprisingly easy to accidentally let future data worm its way into the estimate, resulting in vastly overinflated performance. One of the early mistakes I made was to optimise an equity portfolio using data from 2006-2010, then testing it from 2006-2015. The problem here is the overlap – the 2006 test has had access to data up to 2010, so performance in the 2006-2010 portion of the backtest will be very good.

To be fair, this is often quite an obvious mistake to diagnose, because your portfolios will outperform the benchmark by a ridiculous margin and you will realise that something is wrong. But there are also more subtle ways that you can include future data (e.g survivorship bias), in which case it may not be so clear.

2. Forget about expected returns

Despite their strong theoretical guarantees, efficient frontier portfolios often have poor real-life performance owing to estimation errors in the inputs.

Anyone who has ever tried to pick stocks will know how absurd it is to expect that a stock’s mean return over the past few years will be a good indicator of its future returns. Such simple relationships have almost certainly been arbitraged away – I therefore contend that mean historical returns are almost pure noise. The problem with this is that MVO has no way of distinguishing noise and signal, so often what happens is that the optimiser highlights the noise in the input. This is why there is a ‘running joke’ in the portfolio optimisation literature that a mean-variance optimiser is really just an “error maximiser”1.

In practice, I have found that standard MVO can perform at least in line with the benchmark for equities, but for other asset classes (particularly those which are primarily driven by speculation, like cryptocurrency), the mean historical return is a useless estimator of future return. However, in the next section we will see a simple way round this.

3. You can do better than 1/N

A significant body of research suggests that, in light of the aforementioned failures of MVO, we should instead use 1/N portfolios2 – it is noted that they often beat efficient frontier optimisation significantly in out-of-sample testing.

However, an interesting paper by Kritzman et al. (2010) finds that it is not the case that there is anything special about 1/N diversification: it is just that the expected returns are an incredibly poor estimator and any optimisation scheme which relies on them will likely go astray.

The easiest way to avoid this problem is to not provide the expected returns to the optimiser, and just optimise on the sample covariance matrix instead. Effectively we are saying that although previous returns won’t predict future returns, previous risks might predict future risks. This is intuitively a lot more reasonable – the sample covariance matrix really seems like it should contain a lot of information. Empirical results support this, showing that minimum variance portfolios outperform both standard MVO and 1/N diversification.

In my own work I have found that the standard minimum variance portfolio is a very good starting point, from which you can try a lot of new things:

  • Shrinkage estimators on the covariance matrix
  • Exponential weighting
  • Different historical windows
  • Additional cost terms in the objective function (e.g small-weights penalty)

4. Don’t ignore rebalance costs

When it comes to investment strategy or algorithmic trading, most people leave the “realities” like commission, slippage, and latency to the last stage. However, because of the general high turnover nature of efficient frontier portfolios, this could be quite a dangerous mistake.

In the context of portfolio management, the two main realities that must be accounted for in all backtests are commission and slippage. Latency is not very important because of the longer time horizons involved. Commissions are not hard to analyse: just multiply your broker’s percentage commission by the turnover. Slippage, on the other, hand is a little bit more subtle (though it is only really an issue in less liquid markets). There are a number of variables that affect slippage, but the important ones are transaction volume and urgency (refer to models such as that of Almgren for more).

Specifically, in the case of cryptoasset portfolios, one major problem is the difference in liquidity between bitcoin and a small altcoin like OmiseGo. A portfolio that shifts positions too much among the altcoins may look like it performs well in backtests, but when traded in real life the 0.5% slippage will be a killer.

Once all transaction costs have been included, you may find that it can be difficult to outperform a buy-and-hold market-cap weighted benchmark. Incidentally, this is another failure with 1/N portfolios: for reasonably volatile assets, there is high turnover at each rebalance and thus a large transaction cost.

5. Overfitting

One thing that I respect a lot about machine learning education is that they are fundamentally honest about one of the major issues in the subject: overfitting. Essentially, it is possible to encourage most classifiers to wiggle their decision boundary to accommodate all of the training examples at the cost of generalisation ability. This is done by adjusting the hyperparameters (which do things like specify learning rates) and seeing which ones perform better. Hyperparameter tuning is not a bad thing, but it must be done responsibly lest we simply choose the best parameters that fit the particular train/test setup.

When it comes to standard portfolio optimisation, one doesn’t really have to worry about this because there aren’t any hyperparameters. However, once you start adding different terms to the cost function (each scaled by some parameter), things get a little bit more complicated. For example, let us consider the simple example of minimum variance optimsiation with an L2 regularisation term.

The performance of this portfolio varies greatly depending on the choice of $\gamma$. A natural instinct is to run the optimisation for 10 different values of $\gamma$, but if this is not done carefully, we are probably fitting to the noise within the train and test sets.

Remember that ceteris paribus, a simpler model should be preferred. Black-Litterman, or optimisation with an exotic objective with multiple parameters may be seem to improve performacne on the backtest, but the proof will always be in the out-of-sample pudding. Approach portfolio optimisation like you would any financial machine learning task: with a healthy dose of skepticism.

Conclusion

I still have a lot to learn about portfolio optimisation, particularly with respect to optimisation of higher moments (skew and kurtosis) and things like copula, but the lessons highlighted in this post definitely still apply. If you found this interesting, check out:

PyPortfolioOpt   Star

References

  1. Michaud, R. (1989). “The Markowitz Optimization Enigma: Is Optimization Optimal?” Financial Analysts Journal 45(1), 31–42. 

  2. Demiguel, Victor & Garlappi, Lorenzo & Uppal, Raman. (2009). Optimal Versus Naive Diversification: How Inefficient is the 1/N Portfolio Strategy?. Review of Financial Studies. 22. 10.1093/rfs/hhm075. 


Exponential Covariance

For the past few months, I have been doing a lot of research into portfolio optimisation, whose main task can be summarised as follows:

Is there a way of combining a set of risky assets to produce superior risk-adjusted returns compared to a market-cap weighted benchmark?

The answer of Markowitz (1952) is in the affirmative, with some major caveats. Given the expected returns and the covariance matrix (which encodes asset volatilities and correlations), one can find the combination of asset weights which maximises the Sharpe ratio. But because we don’t know the expected returns or future covariance matrix a priori, we commonly replace these with the mean historical return and sample covariance matrix. The problem is that these are very noisy estimators (especially the mean return), so much so that a significant body of research suggests that a naive diversification strategy (giving each asset equal weight) outperforms most weighting schemes. However, work by Kritzman, Page and Turkington (2010) affirms the intuition that there must be some information in the sample covariance matrix; accordingly, they observe that minimum variance portfolios can beat 1/N diversification.

In my own research, I have found this to be true. Minimum variance optimisation and its variants (no pun intended) can significantly outperform both 1/N diversification and a market-cap weighted benchmark. The nice thing about minimum variance optimisation is that success largely depends on how well you can estimate the covariance matrix, which is easier than estimating future returns. The most common methods are

  • Sample covariance – standard, unbiased and efficient, but it is known to have high estimation error which is particularly dangerous in the context of a quadratic optimiser.
  • Shrinkage estimators – pioneered by Ledoit and Wolf, shrinkage estimators attempt to reduce the estimation error by blending the sample covariance matrix with a highly structured estimator.
  • Robust covariance estimates – estimators that are robust to recording errors, such as Rousseeuw’s Minimum Covariance Determinant .

I have been experimenting with a new alternative, which I call the exponential covariance matrix (to be specific, the exponentially-weighted sample covariance matrix). In this post, I will give a brief outline of the motivation and conceptual aspects of the exponential covariance. However, because I am currently using this professionally, I will be intentionally vague regarding implementation details.

Motivation

In many technical indicators, we see the use of an exponential moving average (EMA) rather than the simple moving average (SMA). The EMA captures the intuition that recent prices are (exponentially) more relevant than previous prices. If we let $p_0$ denote today’s price, $p_1$ denote yesterday’s price, $p_n$ denote the price n days ago, the exponentially weighted mean is given by:

$\alpha$ parameterises the decay rate ($0 < |\alpha| < 1$): by observation we note that higher $\alpha$ gives more weight to recent results, and lower $\alpha$ causes the exponential mean to tend to the arithmetic mean. Additionally, because $1/\alpha = 1 + (1-\alpha) + (1-\alpha)^2 + \ldots$, we note that ‘weights’ sum to one.

In practice, we do not compute the infinite sum above. Rather, observing that the weights rapidly become negligible, we limit the calculation to some window. This window is not to be confused with the span of the EMA, which is another way of specifying the decay rate – a good explanation can be found on the pandas documentation.

The EMA is useful because it ‘reacts’ to recent data much better than the SMA owing to the exponential weighting scheme, while still preserving the memory of the timeseries.

Covariance

Covariance, like correlation, measures how two random variables X and Y move together. Let us think about how we would define this metric. For variables that have high covariance, we would expect that when $x$ is high, so is $y$. When $y$ is low, $x$ should be too. To capture this idea, we can proceed as follows.

For each pair $(x_i, y_i)$ in the population, we measure how far away $x_i$ and $y_i$ are from their respective means, then multiply these distances together. If these differences have the same sign, i.e $x_i$ is greater than $\bar{x}$ and $y_i$ is greater than $\bar{y}$, there is a positive contribution to the covariance. If $x_i$ is less than $\bar{x}$ but $y_i$ is greater than $\bar{y}$, there is a negative contribution. We can then sum over all observations, and divide by the number of observations to get some kind of ‘average co-variation’. In fact, this is the exact definition of the population covariance:

Please note that in practice you would always use the sample covariance instead, which has a factor of $1/(N-1)$ rather than $1/N$, but this is not something we will worry about here.

Covariance of asset returns

One would think that the easiest approach is to take two price series (e.g stock prices for AAPL and GOOG), then compute the daily percentage change or log returns, before feeding these into a covariance calculation. This does work, and is the standard approach. But in my view, this is carelessly throwing away a good deal of information, because:

Covariance does not preserve the order of observations.

You will get the same covariance whether you provide $(x_2, y_2), (x_{17}, y_{17}), (x_8, y_8), \ldots$ or $(x_1, y_1), (x_2, y_2), (x_3, y_3), \ldots$.

However, in the case of time series, the order of the returns is of fundamental importance. Thus we need some way of incorporating the sequential nature of the data into the definition of covariance. Fortunately, it is simple to apply our intuition of the EMA to come up with a similar metric for covariance.

We will rewrite our previous definition as follows:

Rather than letting $(x_i, y_i)$ be any observations from the dataset, let us preserve the order by saying that $(x_i, y_i)$ denotes the returns of asset X and Y $i$ days ago. Thus $(x_1 - \bar{x})(y_1 - \bar{y})$ specifically refers to the co-variation of the returns yesterday.

The next step should now be clear. We simply give each co-variation term an exponential weight as follows:

Or more simply:

And we are done! This simple procedure is all that is required to incorporate the temporal nature of asset returns into the covariance matrix.

Conclusion

This post has presented a modification of the covariance matrix especially suited to time series like asset returns. It is simple to extend this to an exponential covariance matrix and use it in portfolio optimisation – it is reasonable to suggest that this matrix will be positive definite if the sample covariance matrix is positive definite.

I have used the exponential covariance to great effect regarding portfolio optimisation on real assets. Backtested results have affirmed that the exponential covariance matrix strongly outperforms both the sample covariance and shrinkage estimators when applied to minimum variance portfolios.

At some stage in future, I will consider implementing this in my portfolio optimisation package PyPortfolioOpt, but for the time being this post will have to suffice. Please drop me a note if you’d like to use the ideas herein for any further research or software.

Update: as of 20/9/18, exponential covariance has been added to PyPortfolioOpt!