Assessing Human Error Against a Benchmark of Perfection

Anderson, A., & Kleinberg, J. (2016). Assessing Human Error Against a Benchmark of Perfection. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 705–714). San Francisco, California, USA: ACM.

Can we analyse a large dataset of human decisions to discover the instances in which people are most likely to make errors? Instead of using ML to make optimal decisions, we can use it to predict when humans are more prone to error.

Chess is an ideal system to analyse:

In order to use chess as a model system, there are three obvious possible approaches (each with fatal flaws):

  1. Construct problems with a well-defined ground truth (e.g tactics puzzles). However, it is difficult to amass a sufficient dataset.
  2. Use chess databases and chess engines to evaluate where people’s decisions have differed from those of the engines. However, it is still difficult to find a mapping between the engine’s move choice and a determination of human error.
  3. Calculating the minimax value of every position: any move that decreases the minimax value more than a certain amount can be called a blunder. However, this is computationally intractable.

A solution is to exploit the fact that chess has been solved for all positions with at most k pieces on the board (currently $k=7$, from Moscow State University), in the form of tablebases. These are construted by working backwards from terminal positions.

We can start with the large database of recorded games, then restrict them to the positions with less than or equal to $k$ pieces, and compare the move played with the best move according to the tablebase. This gives the added advantage of allowing blunder analysis on fixed positions – seeing how different people behave on the same position.

The features investigated are as follows:

Position difficulty

Player skill




I thought this was a very detailed and honestly fascinating investigation into errors of human decision-making. Of course, it may be difficult to generalise the results beyond chess, but it really is an impressive start.

This doesn’t really feel like your typical comp-sci paper, it’s more of a Kahneman/Tversky -esque piece on decision theory. However, because it is classified as a comp-sci paper on arXiv, I would have liked to see a bit more about the exact implementation, be it code or otherwise.