Advances in Financial Machine Learning – Marcos Lopez de Prado

These notes are not comprehensive: they aim to be an executive summary of the concepts presented in the book, which is very detailed and extensive. Please refer to the textbook itself for implementation details and code.

1. Financial ML as a distinct subject

Chapter 1 is available for free online at SSRN. I would highly recommend that it be read in its entirety – the below notes do not do justice.

2. Financial Data Structures

2.1 Bars

2.1.1 Information-driven bars

\[E_0(\theta_T) = E(T)E(b_t) = E(T)[2P(b_t=1)-1]\] \[\| \theta_T\| \geq \| E(\theta_T)\|\] \[\theta_T = \sum_{t=1}^T b_t v_t\] \[\theta_T = \max \left( \sum_{t | b_t = 1}^T b_t, - \sum_{t | b_t = -1}^T b_t \right)\]

2.2 Multi-product series

2.3 Sampling features

\[S_t = max(0, S_{t-1} + y_t - E_{t-1}(y_t))\]

3. Labelling features

\[y =\left\{\begin{aligned} &-1,& r < - \tau\\ &0,& |r| \leq \tau \\ &1,& r > \tau \end{aligned}\right.\]

4. Sample weights

5. Fractionally differentiated features

6. Ensemble methods

7. Cross-validation

8. Feature importance

Marcos’ First Law: Backtesting is not a research tool. Feature importance is.

9. Hyperparameter tuning

\[\ln x \sim U(\ln a, \ln b)\]

10. Bet sizing

This is a hard chapter to summarise because it is terse and highly mathematical. It is best to consult the original text to learn more.

11. The dangers of backtesting

A backtest is not an experiment. It is a sanity check for behaviour under realistic conditions. There are many errors one can make, but these are the 7 deadly sins:

  1. Survivorship bias
  2. Lookahead
  3. Storytelling (explaining a historical event post-hoc)
  4. Data snooping
  5. Ignoring transaction costs (e.g slippage)
  6. Succeeding based on outliers, which may not ever happen again
  7. Shorting without considering the costs and consequences.

Marcos’ Second Law: Backtesting while researching is like drink driving. Do not research under the influence of a backtest

12. Backtesting through CV

13. Backtesting on synthetic data

14. Backtest statistics

General characteristics

Performance metrics

Runs and drawdowns

Implementation shortfall


Break down returns by sector, risk class, timeframe to gain better insight as to where positive and negative performance is coming from.


Marcos’ Third Law: Every backtest must be reported with all trials involved in its production.

15. Understanding strategy risk

This is an important chapter which teaches us how to think about the overall strategy risk, and the sensitivity of the outcome to our initial assumptions.

16. Machine learning asset allocation

This chapter focuses on hierarchical risk parity (HRP) portfolio optimisation. I have found in practice that HRP portfolios don’t seem to be the panacea that Lopez de Prado implies they may be, but nevertheless it is a good technique to have in the toolbox. Additionally, this chapter feels very out of place, so I have neglected to include it here.

17. Structural breaks

18. Entropy Features

\[MI(X, Y) = E_{f(x,y)}[\log \frac{f(x,y)}{f(x)f(y)}] = H(X) + H(Y) - H(X, Y)\] \[MI(X, Y ) = -\frac 1 2 \log(1-\rho^2)\]

19. Microstructural features

19.1 Price sequences

\[b_t =\left\{\begin{aligned} &1,& \Delta p_t < 0 \\ &-1,& \Delta p_t < 0 \\ &b_{t-1},& \Delta p_t = 0 \end{aligned}\right.\]

19.2 Strategic Models

\[|\Delta \log (p_t)| = \lambda \sum_{t \in B_t}(p_t V_t) + \epsilon_t\]

19.3 Sequental models

\[VPIN = \frac{\sum_{\tau = 1}^n |v_\tau^B - v_\tau^S|}{nV}\]

19.4 Other microstructural features

20-23 High performance computing

Though I read the chapters with interest, I didn’t feel the need to take notes as the majority of the content here is not very applicable to small players.