Exponential Covariance

For the past few months, I have been doing a lot of research into portfolio optimisation, whose main task can be summarised as follows:

Is there a way of combining a set of risky assets to produce superior risk-adjusted returns compared to a market-cap weighted benchmark?

The answer of Markowitz (1952) is in the affirmative, with some major caveats. Given the expected returns and the covariance matrix (which encodes asset volatilities and correlations), one can find the combination of asset weights which maximises the Sharpe ratio. But because we don’t know the expected returns or future covariance matrix a priori, we commonly replace these with the mean historical return and sample covariance matrix. The problem is that these are very noisy estimators (especially the mean return), so much so that a significant body of research suggests that a naive diversification strategy (giving each asset equal weight) outperforms most weighting schemes. However, work by Kritzman, Page and Turkington (2010) affirms the intuition that there must be some information in the sample covariance matrix; accordingly, they observe that minimum variance portfolios can beat 1/N diversification.

In my own research, I have found this to be true. Minimum variance optimisation and its variants (no pun intended) can significantly outperform both 1/N diversification and a market-cap weighted benchmark. The nice thing about minimum variance optimisation is that success largely depends on how well you can estimate the covariance matrix, which is easier than estimating future returns. The most common methods are

  • Sample covariance – standard, unbiased and efficient, but it is known to have high estimation error which is particularly dangerous in the context of a quadratic optimiser.
  • Shrinkage estimators – pioneered by Ledoit and Wolf, shrinkage estimators attempt to reduce the estimation error by blending the sample covariance matrix with a highly structured estimator.
  • Robust covariance estimates – estimators that are robust to recording errors, such as Rousseeuw’s Minimum Covariance Determinant .

I have been experimenting with a new alternative, which I call the exponential covariance matrix (to be specific, the exponentially-weighted sample covariance matrix). In this post, I will give a brief outline of the motivation and conceptual aspects of the exponential covariance. However, because I am currently using this professionally, I will be intentionally vague regarding implementation details.


In many technical indicators, we see the use of an exponential moving average (EMA) rather than the simple moving average (SMA). The EMA captures the intuition that recent prices are (exponentially) more relevant than previous prices. If we let $p_0$ denote today’s price, $p_1$ denote yesterday’s price, $p_n$ denote the price n days ago, the exponentially weighted mean is given by:

$\alpha$ parameterises the decay rate ($0 < |\alpha| < 1$): by observation we note that higher $\alpha$ gives more weight to recent results, and lower $\alpha$ causes the exponential mean to tend to the arithmetic mean. Additionally, because $1/\alpha = 1 + (1-\alpha) + (1-\alpha)^2 + \ldots$, we note that ‘weights’ sum to one.

In practice, we do not compute the infinite sum above. Rather, observing that the weights rapidly become negligible, we limit the calculation to some window. This window is not to be confused with the span of the EMA, which is another way of specifying the decay rate – a good explanation can be found on the pandas documentation.

The EMA is useful because it ‘reacts’ to recent data much better than the SMA owing to the exponential weighting scheme, while still preserving the memory of the timeseries.


Covariance, like correlation, measures how two random variables X and Y move together. Let us think about how we would define this metric. For variables that have high covariance, we would expect that when $x$ is high, so is $y$. When $y$ is low, $x$ should be too. To capture this idea, we can proceed as follows.

For each pair $(x_i, y_i)$ in the population, we measure how far away $x_i$ and $y_i$ are from their respective means, then multiply these distances together. If these differences have the same sign, i.e $x_i$ is greater than $\bar{x}$ and $y_i$ is greater than $\bar{y}$, there is a positive contribution to the covariance. If $x_i$ is less than $\bar{x}$ but $y_i$ is greater than $\bar{y}$, there is a negative contribution. We can then sum over all observations, and divide by the number of observations to get some kind of ‘average co-variation’. In fact, this is the exact definition of the population covariance:

Please note that in practice you would always use the sample covariance instead, which has a factor of $1/(N-1)$ rather than $1/N$, but this is not something we will worry about here.

Covariance of asset returns

One would think that the easiest approach is to take two price series (e.g stock prices for AAPL and GOOG), then compute the daily percentage change or log returns, before feeding these into a covariance calculation. This does work, and is the standard approach. But in my view, this is carelessly throwing away a good deal of information, because:

Covariance does not preserve the order of observations.

You will get the same covariance whether you provide $(x_2, y_2), (x_{17}, y_{17}), (x_8, y_8), \ldots$ or $(x_1, y_1), (x_2, y_2), (x_3, y_3), \ldots$.

However, in the case of time series, the order of the returns is of fundamental importance. Thus we need some way of incorporating the sequential nature of the data into the definition of covariance. Fortunately, it is simple to apply our intuition of the EMA to come up with a similar metric for covariance.

We will rewrite our previous definition as follows:

Rather than letting $(x_i, y_i)$ be any observations from the dataset, let us preserve the order by saying that $(x_i, y_i)$ denotes the returns of asset X and Y $i$ days ago. Thus $(x_1 - \bar{x})(y_1 - \bar{y})$ specifically refers to the co-variation of the returns yesterday.

The next step should now be clear. We simply give each co-variation term an exponential weight as follows:

Or more simply:

And we are done! This simple procedure is all that is required to incorporate the temporal nature of asset returns into the covariance matrix.


This post has presented a modification of the covariance matrix especially suited to time series like asset returns. It is simple to extend this to an exponential covariance matrix and use it in portfolio optimisation – it is reasonable to suggest that this matrix will be positive definite if the sample covariance matrix is positive definite.

I have used the exponential covariance to great effect regarding portfolio optimisation on real assets. Backtested results have affirmed that the exponential covariance matrix strongly outperforms both the sample covariance and shrinkage estimators when applied to minimum variance portfolios.

At some stage in future, I will consider implementing this in my portfolio optimisation package PyPortfolioOpt, but for the time being this post will have to suffice. Please drop me a note if you’d like to use the ideas herein for any further research or software.

Update: as of 20/9/18, exponential covariance has been added to PyPortfolioOpt!

Stormy Seas for Proof of Work

In this post we will be examining one of the main problems with Proof of Work (PoW) – not the energy inefficiency (as it is debatable how much of a problem this really is), but something more fundamental with the consensus process. In the past couple of months we have seen a number of cryptocurrencies fall victim to 51% attacks. Verge, Bitcoin Gold, ZenCash, and Electroneum are just a few coins that have been targeted, resulting in a total equivalent theft of $5 million (not to mention the subsequent loss in market value of the coins).

51% attacks are a basic problem in distributed ledger technology, covered in any crypto 101 course. Essentially, each individual node in a decentralised network is responsible for validating transactions and optionally submitting a block of these transactions to the blockchain – doing so requires the node to solve hash puzzles (this is why it is called Proof of Work). The beauty of this system is that each node has a say in what happens, proportional to the amount of hash power they contribute – thus the system is a democracy of sorts. However, a natural corollary of this is that any node or group of nodes that achieves a majority of the hash power can ‘outvote’ the rest of the network, allowing them to conduct a 51% attack.

Standard theory dictates that if there are enough independent nodes on a distributed ledger, we can reap the benefits of democracy while knowing that it would be immensely costly for a malicious party to achieve 51% of the hash power. This may be true for cryptocurrencies with many active nodes (like Bitcoin and Ethereum), but with the proliferation of multitudinous altcoins, we may not be able to say the same for more obscure tokens. It is thus the case that 51% attacks have gone from being a textbook problem to something very real, actively destroying both reputation and value in the crypto space.

In this post, we will explore how an attacker might go about conducting a 51% attack, examine the features that make a coin most vulnerable, and comment on prevention and mitigation. An obvious disclaimer: nothing in this post should be construed as a recommendation or a practical guide on conducting a 51% attack.

Malicious actors

It is no secret that cryptocurrency has attracted a large number of malicious parties, ranging from exchange-hackers to phishing scammers. Part of the reason for this is the underlying anonymity/psuedonymity of the system. If someone were to steal one million USD, they would likely require a network of offshore bank accounts to get away with it. However, if you were to steal 200BTC, all you’d have to do is convert it to a privacy coin like monero on any one of the exchanges that doesn’t require KYC, and law enforcement would have a very hard time catching up with you.

In this post we are dealing with a narrow subset of malicious actors: those who play by the rules. There are multiple ways a 51% attacker can ‘legally’ (in terms of the blockchain protocol) use their majority hash power to behave maliciously, one example being the denial of service to users in a network. But by far the most direct way of benefiting at the cost of others is to double spend, which is the when the same coin is sent to multiple parties in a network.

How a 51% attacker might double spend

Once an attacker has identified a suitable target (more on this later), this is a rough sketch of how they might profit. Again, this is purely educational and hypothetical – by no means is it a practical guide.

I will refer to the target coin as TCOIN. Many of these steps also require an anonymous “exit coin” – I will use Monero (XMR) as an example.

  1. Acquire some TCOIN anonymously, e.g via an offline swap of fiat for XMR then XMR for TCOIN.
  2. Set up an account with exchange A and exchange B. Clearly these exchanges should have minimal KYC.
  3. Send 1000 TCOIN from your TCOIN address to that of exchange A, then immediately cash it out to XMR.
  4. Acquire 51% of the TCOIN network’s hash power, then make a new TCOIN transfer to exchange B. This transaction should be put into a block that orphans the previous block, so although exchange A thinks they have received your TCOIN and you have cashed it out to XMR, in reality it is exchange B that has received the TCOIN.
  5. On exchange B, convert TCOIN to XMR and send it to your monero wallet.

In general terms, this describes how the double spend lets you manufacture 1000 TCOIN from thin air (at the expense of the first exchange). An optional additional step is to first short TCOIN, because we have seen that 51% attacks severely reduce public trust in the coin’s development team and tend to lead to a sudden price drop.

Thus far this has all been theoretical – we have presented a textbook situation of how a double spend might work in theory. In the next section, we will understand the worrying practical reality of the situation.

Vulnerability of different coins

One would think that there should be a perfect correlation between the market cap of a coin and its overall hash rate – that is, the demand for a coin should be correlated with the ‘strength’ of the network validation. However, empirically this is not the case. Because of the rise of ill-informed speculation, we see that some coins with a high market cap (a few hundred million USD) have a surprisingly low hash rate. This is the clear yet disturbing message of crypto51.app, a simple site which displays the cost of becoming a 51% attacker (over 1 hour) for different PoW cryptocurrencies.

I submit that there are four main features an attacker would look for in an ideal target coin, apart from the obvious factor of having a smaller hash rate (and thus a lower attack cost).

  • High market cap (as a proxy for interest/liquidity in the coin)
  • Low transaction time (i.e very few block confirmations)
  • Support on KYC-free exchanges
  • A common hashing algorithm, or at least one that is supported on a mining marketplace like NiceHash. The advantage of this for an attacker is that they can just rent hash power anonymously rather than having to acquire hardware.

Using python’s requests and beautifulsoup, I scraped the data from crypto51 in order to do my own analysis (available on a jupyter notebook) of which coins were most vulnerable:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.crypto51.app/")
soup = BeautifulSoup(r.text, "lxml")
data = []
headers = []
table = soup.find('table')
column_names = table.find('thead').find_all('th')
headers = [i.text.strip() for i in column_names]
rows = table.find('tbody').find_all('tr')
for row in rows:
    data.append([i.text.strip() for i in row.find_all('td') if i])
df = pd.DataFrame(data, columns=headers)

I then processed the data and generated a plot of the attack cost versus the market cap, coloured by NiceHash-ability (all log-transformed). This is very easy to do using pandas plotting:

df.plot.scatter("Log Market Cap", "Log Attack Cost", 
                c="Log NiceHash", colormap="plasma")

In a graph like this, a target coin should be as close to the lower-right quadrant as possible (high market cap but low attack cost). The lighter the datapoint, the easier it is to attack via NiceHash, which may or may not be important to an attacker.

A linear trendline fits the data with $R^2 = 0.65$, and because a straight line on log-log axes corresponds to a power law relationship, we can calculate the coefficients as follows:

Actually, it is not the general relationship that matters but rather the specific outliers – altcoins below the trendline are those with especially low attack costs relative to their market cap.

The top portion is self explanatory. In the bottom-left quadrant are coins that are easy to attack, but also have such small market caps that they are likely not worth attacking (e.g very poor exchange support). I thus determine the danger zone to be coins with slightly higher market caps (in the order of USD 10 million to 100 million). However, I happened to notice that some of the recently attacked altcoins (listed in the introduction) formed a narrow band slightly above my danger zone. Perhaps this is a liquidity sweet-spot that makes it easier for attackers to exit onto another exchange, or perhaps it’s a coincidence.

In any case, if I were holding something like BTCP (Bitcoin Private), I’d be a little bit worried. It has a market cap of more than \$200 million, and reasonably liquid trading pairs, but costs less than \$500 an hour to attack. If the attacker were unwilling to set up hardware, they might prefer to target something like EMC2 (Einsteinium), which has a higher NiceHash-ability.

Now, in a perfectly efficient market, one would imagine that the possible financial benefit from conducting a 51% attack would outweigh the financial costs. However, recent attacks show that this is clearly not the case (especially because one can now short the target coin prior to the attack). So there is a clear problem in the crypto markets.

What can we do about it?

An obvious solution is to ditch PoW and go with Proof of Stake (PoS). Much easier said than done. Arguably the most mature PoS development effort is that of Ethereum, in the form of Casper. Yet despite the undeniable talent of the dev team, solving the nothing-at-stake problem and ironing out the wrinkles in the implementation is not proving to be a straightforward task. Delegated PoS may be a stepping stone, but the concessions with regard to decentralisation may be a bit off-putting. So assuming that PoW is still the de-facto consensus mechanism, what can the ecosystem do to reduce 51% attacks? Here are some ideas.

Speculators/investors should consider the hash rate of any PoW coins they are looking to invest in, and be aware that a low hash rate makes that blockchain vulnerable for a 51% attack – certainly not good for their investment.

New PoW altcoins should not use the same hashing algorithm as big coins, even if it’s easier to implement – a large miner can just point their hash power at the new blockchain and immediately become a majority. Dev teams for these projects should implement programmatic checks for 51% attacks: a quick response can be critical. Fiat reserves should be maintained, which can quickly be used to add hashing power to the network if a potential 51% attack is detected.

Exchanges should be wary of what tokens they list, and should increase the required block confirmations for transactions, making it more costly for attackers to rewrite the recent history of a blockchain. Yes, this would make it slightly more inconvenient for users, but 51% attacks can directly lead to material losses.


Proof of Work is a genius solution to the distributed consensus problem, and I often have to remind myself just how amazing the original bitcoin protocol is. But I believe that in PoW, 51% attacks should be possible – they are a direct consequence of democracy. For large decentralised networks like that of bitcoin or Ethereum this is not a problem, but owing to the speculative interest in cryptocurrencies, the demand for some altcoins has separated entirely from the security of their network, resulting in a kind of arbitrage which allows malicious actors to profit by conducting a double spending 51% attack.

This post’s analysis of coin vulnerability has not been perfect: the main flaw is our use of market cap as a proxy for the profitability of a 51% attack. In reality, the profitability is a function of block confirmation times and exchange liquidity. Market cap is arguably correlated with the two, but it is not as direct.

I don’t claim to have all the solutions to the problem of 51% attacks, but it is evident that no single party can solve the issue alone. Until the different players (miners, users, exchanges, investors) in the ecosystem contribute their part, coins with low hash rates will continue to be low hanging fruit for the aspiring 51% attacker; I’d wager that we’ll see a few more big attacks before the year ends.

Evolving cellular automata to solve problems, part 2

We will be picking up where the previous post left off. As a brief summary, we are attempting to replicate the results of Evolving Cellular Automata with Genetic Algorithms (Mitchell, Crutchfield and Das 1996), dealing with the density classification task for 1D binary cellular automata (CAs). To put it simply, we are trying to design a ruleset such that the final configuration of a cellular automaton after M iterations is either all 1s or all 0s depending on which class was more common in the initial state. The caveat is that each cell in the universe can only make its decision based on the three neighbours to the left and right.

We examined the failures of the majority ruleset, an ‘obvious’ solution in which each cell updates to the local majority among its neighbourhood, and thus see the need for something more sophisticated. At the same time, we saw that the space of possible solutions is in the order of $10^{38}$, meaning that any kind of brute force method will not be feasible. This invites the use of genetic algorithms (GAs), with which we could evolve a CA to perform this task without ever explicitly devising a ruleset.

Genetic algorithms in the context of CAs

A complete treatment of genetic algorithms is not in the scope of this post (the wikipedia page is a good start), but I will attempt to cover the salient points:

  • We start with a population of different possible rulesets. This is a very important point – the individuals in the population are not the cellular automata, they are what we are trying to optimise, which in this case is the ruleset.
  • Rulesets in the population have different fitnesses, corresponding to how well they solve the $\rho_c = 1/2$ task. We will define the fitness F to be the proportion of random ICs on which the ruleset succeeds (becomes all 1 or all 0 correctly). So if we test a ruleset on 100 random ICs and 35 are classified correctly, $F = 0.35$
  • In each generation, we compute the fitness for each ruleset in the population. The fittest rulesets (the elites) are cloned to the next generation, but we also crossover (i.e breed) these rulesets to form offspring, which are then mutated before joining the next generation.
  • Over many generations, we should see the fitness increase, as the traits of the favourable rulesets are passed onto the next generation.

The GA acts as a more efficient way to traverse the huge space of possible rulesets in search of one that can suitably solve the density classification task – it does not guarantee that the global optimum will be reached.


For the most part, the functions used in this post are the same as in part 1, though took care to make obvious optimisations because the evolution of the CAs will involve repeated calls of the same functions.

The initial population consists of random rulesets. To generate a single random ruleset, we zip the the rule inputs (7-bit binary strings) with random 1s or 0s which are obtained from the uniform_random_binary function we defined. For convenience, we can uniquely refer to a given ruleset with a hexadecimal ID, which is made by converting the binary string of all rule outputs (assuming ordered inputs) into hexadecimal.

def random_ruleset(n_rules):
    rule_values = uniform_random_binary(n_rules)
    hex_id = hex(int(rule_values, 2))
    return hex_id, dict(zip(rule_keys(n_rules), rule_values))

To generate the initial population of 100 (or more generally, n_ic) ICs, it is not as simple as running random_ruleset 100 times because we want exactly half of the ICs to have a majority of 1s.

def generate_random_ICs(universe_size, n_ic=100):
    ics = []
    cutoff = int(0.5 * universe_size) + 1
    # Exactly half of the initial conditions will have p < 0.5
    for _ in range(n_ic // 2):
        minority = uniform_random_binary(universe_size, 0, cutoff)
        majority = uniform_random_binary(universe_size, cutoff, 
                                         universe_size + 1)
    return ics

Prior to implementing crossover and mutation, we will define a helper function that converts a hex ID into a ruleset, by converting the hexadecimal to binary and zipping this with a list of 7-bit rule inputs:

def hex_to_ruleset(hex_id, n_rules=128):
    rule_values = bin(int(hex_id, 16))[2:].zfill(n_rules)
    return dict(zip(rule_keys(n_rules), rule_values))

As discussed in the paper, crossover involves pointwise swaps between two parent rulesets, and mutation involves randomly switching bits.

def crossover_and_mutate(hex1, hex2, n_rules=128, 
                         n_crossover_points=1, n_mutations=2):
    bin1 = list(bin(int(hex1, 16))[2:].zfill(n_rules))
    bin2 = list(bin(int(hex2, 16))[2:].zfill(n_rules))
    # Crossover: get indices of unequal, then
    # randomly select an index to swap
    unequal_indices = [i for i, digit in enumerate(bin1) if 
                       bin2[i] != digit]
    if len(unequal_indices) > n_crossover_points:
        crossover_points = random.sample(unequal_indices, 
        for c in crossover_points:
            bin1[c] = bin2[c]
    # Mutation
    for _ in range(n_mutations):
        idx = random.randrange(n_rules)
        # Flip bit
        bin1[idx] = str(int(not bool(bin1[idx])))
    return hex(int("".join(bin1),2))


As with most optimisation methods, genetic algorithms have a number of metaparameters that can drastically change the resulting behaviour, for example the number of generations, the mutation rate, the size of the initial population of chromosomes, etc.

GEN = 50
N_CHROMOSOMES = 100  # number of rulesets in each generation

In addition, we shouldn’t forget the CA parameters:

N = 59  # universe size
MAX_ITER = 100  # max number of iterations to perform the task
N_IC = 100  # number of ICs each ruleset is tested on
N_RULES = 128  # number of rules in each ruleset

These are mostly similar to the choices of Mitchell et al., though I have opted for fewer generations and a much smaller universe (they use 100 and 149 respectively), which should be faster to compute and serve as a proof-of-concept.

It is important to note that genetic algorithms can be quite sensitive to initial conditions, which is why we will parameterise each run by a random seed.


We are now ready to implement the actual genetic algorithm. The first thing to do is to set the random seed and generate the initial population. We will use a dictionary, with the keys being hex IDs and values being the rulesets.

population = {}
for _ in range(N_CHROMOSOMES):
    id, ruleset = random_ruleset(N_RULES)
    population[id] = ruleset

In each generation, we will first compute the fitness of each ruleset on 100 random ICs:

generation_ics = generate_random_ICs(N, N_IC)
fitness_log = []

for hex_id, chromosome in population.items():
    # Evaluate performance on random ICs.
    fitness = 0
    for config in generation_ics:
        # Count the majority
        majority = config.count("1") > len(config) // 2
        # Iterate automaton
        hist = iterate_automata(config, chromosome, MAX_ITER, 
        # If the CA has classified correctly, increment fitness
        if hist[-1] == str(int(majority)) * len(config):
            fitness += 1
    print("hex_id", hex_id, "  Fitness", fitness)
    fitness_log.append((hex_id, fitness))

Next, we determine who are the elites (the top 20 rulesets):

# Find top 20 in population
fitness_log.sort(key=lambda x: x[1], reverse=True)
elites = [i[0] for i in fitness_log[:20]]
print("Generation best:", fitness_log[0])
print("Least fit elite:", fitness_log[20])

As suggested by the paper, we will copy these elites directly to the next generation, then form the rest of the population by breeding these elites with replacement.

# Copy the top 20 to the next generation
new_population = {}
for i in elites:
    new_population[i] = hex_to_ruleset(i)

# Crossover and mutate random pairs of elites
for _ in range(N_CHROMOSOMES - 20):
    parent1, parent2 = random.sample(elites, 2) 
    child_hex = crossover_and_mutate(parent1, parent2, 
    new_population[child_hex] = hex_to_ruleset(child_hex)

population = new_population

We then repeat the above for many generations, and hopefully should see the fitnesses increase.

Initial observations

Judging by the success of the previous post, I was expecting the script to work immediately and yield interesting results. Alas, programming is rarely so simple. The genetic algorithm was slow (I’m used to the 5-10 seconds it takes to train most sklearn models), but what is worse, it was not able to solve the task. After 50 generations, the GA has not managed to beat random guessing:

Gen 49 
hex_id 0x455d652de3a5534054ed51a12093d25   Fitness 50
hex_id 0x455d652de3a5534054ec51a12093d27   Fitness 50
hex_id 0x455d652de3a5534054ed51212093d25   Fitness 50
hex_id 0x455d652de3a55b40146d51a12093d27   Fitness 50
Generation best: ('0x455d652de3a5534054ed51a12093d25', 50)
Least fit elite: ('0x455d652de3a55b40146d51a12093d277', 50)

A closer analysis revealed that many of the final rulesets corresponded to the naïve strategy of mapping any possible input to a 1, which results in the ruleset being correct about 50% of the time.

It should be noted that Mitchell et al. trained their genetic algorithm across 300 different runs, so it could be the case that this was one of the unlucky runs in which the GA fails to learn. But testing multiple random seeds yielded no further improvements.

A possible fix

I hypothesised that the initial population contained too many rulesets that had a clear majority of 1s or 0s, disproportionately mapping their input to one of the two regardless of the IC. Remember that this was by design, we chose to sample rulesets and ICs from a distribution uniform over the density of 1s rather than the unbiased distribution which is binomial over the density of 1s. In light of the poor empirical results, I modified the random_ruleset function to use the unbiased distribution instead:

def random_ruleset2(n_rules):
    random_binary = bin(random.randint(0, 2**n_rules - 1))
    rule_values = str(random_binary)[2:].zfill(n_rules)
    hex_id = hex(int(rule_values, 2))
    return hex_id, dict(zip(rule_keys(n_rules), rule_values))

A brief interjection: because it seemed that I would have to test multiple parameters, I refactored the GA script into a function:

evolve_automata(59, n_crossover_points=1, n_mutations=6,       
                random_seed=0, n_ic=100, n_gen=50, 

Using this new ruleset generator, we are much less likely to have naïve outputs of all 1s or all 0s. My guess was that although this would result in the GA starting with poorer fitnesses, they would not get ‘stuck’ with naïve random-guessing strategies. However, testing revealed that fitness still converged to 50%, just at a much slower rate.

To combat this, I reasoned that the simplest way to escape local optima is to pump up the mutation rate in the hope that elites will be found that beat random guessing, and these will pass on their traits to the next generation.

Finally, we achieved some progress. I noticed that the combination of a high mutation rate and an unbiased random ruleset meant that the GA didn’t always get stuck with a 50% fitness. For N_MUTATIONS = 6, tested over 20 different seeds, mean elite fitnesses in the last 10 generations were:

[83.5, 52.2, 92.6, 51.2, 51.1, 50.0, 51.5, 51.5, 51.7, 79.2, 
 83.1, 51.7, 73.5, 57.2, 51.6, 51.2, 52.1, 50.0, 78.7, 82.9]

Eight of the 20 runs resulted in fitnesses beyond the 50% level, however only one of the 20 beat 90%. This falls very far short of Mitchell et al’s claim to have at least 90% fitness on all 300 of the runs, but it is certainly progress.

Interestingly, increasing N_MUTATIONS to 12 produced worse results, with the maximum mean elite fitness being 80.2 and only six of the 20 runs resulting significantly beating 50%. N_MUTATIONS=18 does even worse, with only one of the 20 beating 50%. I think this is because having many mutations reduces the impact of crossover (which is only at a single point), so we don’t get the benefits of ‘meiosis’.

Analysis of evolved CAs

We will examine one of the rulesets which achieved a fitness of above 90%. For some different initial configurations, this is what we observe:

Remember that purple is alive and white is dead; I have displayed the actual majority class above each subplot. Thus the fifth subplot represents an error, because despite the majority of 1s, the final state is all 0s.

As in the paper, the ruleset works by quickly reaching a steady state of all 0s unless large blocks of 1s are found – it is not really the sophisticated emergent behaviour that I was hoping to see. The only exception is the first subplot, in which some semblance of the particle strategy is achieved.


Though we did achieve some interesting results, I don’t think that this exploration overall was a success. I could not replicate the results of Mitchell et al., even using a simple universe of 59 cells rather than 149. The default setup used in the paper kept getting stuck at 50% fitness, and it was only after I deviated from their suggestions that I managed to do better.

I have become somewhat disillusioned regarding the scientific value of the paper (though I still find it fascinating). The sophisticated particle strategy discovered by Mitchell et al. is remarkable, yet they seem to downplay the fact that it was observed only in a few of the 300 runs of the genetic algorithm. This is hardly reproducible, and really feels to me like grabbing at patterns in the random darkness.

Additionally, the evolution of CAs with genetic algorithms is a very slow process (I had to leave my code running over a few nights), and the results are very inconsistent. I can only imagine how long it would have taken for Mitchell et al. to conduct 300 runs with a universe size of 149 (before the turn of the millennium!). This severely limits the practical applications, and the prospect that evolved CAs could tackle more complex tasks like image processing (as suggested in the paper’s conclusion) seems chimerical.

Nevertheless, I am happy to have attempted to replicate the results, because it has been a while since I’ve played around with genetic algorithms. It was also some extra practice with python’s standard language features, of which I make extensive use. Perhaps some time in future I may realise that my implementation had a flaw and author a part 3, but for now I’m content to stay in this local optimum.

The code for this post, along with plots and data, is available as a jupyter notebook on GitHub.