Author: Alex Labram

Next in the series of articles on Data Science. Find all articles and further information on the Data Science MIG page.

In our previous article “Statistics vs ML”, we introduced you to the model fitting framework used by machine learning practitioners.  We discussed the train / validate / test split, selection of an appropriate accuracy metric, tuning of hyperparameters against the validation dataset, and scoring of the final best-of-breed model against the test dataset. 

This general framework can – in principle, and often with modifications – be used with any ML algorithm: penalised regression, random forests, support vector machines, Bayesian networks, neural networks, etc.  Notably, it requires minimal awareness of the inner workings of a given learning algorithm.

At this point, many data scientists would stop worrying about the details of their modelling approach.  Why would they need to understand what goes on inside the black box of the algorithm, when it demonstrably works?  However, as actuaries, we typically work in contexts (e.g. regulatory) and with other practitioners (e.g. classical statisticians) that require us to know, and to explain, what’s going on under the hood.

In this article, I will talk you through the theory and application of a particularly popular statistical learning algorithm called XGBoost.

History

XGBoost was first released in 2014 by then-PhD student Tianqi Chen.  Chen, a previous winner of the KDDCup data science competition, had found that his preferred approach (“boosted decision trees” – discussed below) was poorly supported by existing software.

His alternative, released as a stand-alone command-line program, gained prominence later that year when it jumped to the top of a Kaggle contest leaderboard.  The Higgs Boson Machine Learning contest asked participants to explore the properties of this particle after its discovery in 2012, particularly focusing on the identification of Higgs decay events in simulated data.  XGBoost quickly produced the model with the highest predictive power.

After this initial success, the algorithm quickly became immensely popular in machine learning circles: by 2015 it was used in over half of winning Kaggle contest entries.  Since then it has demonstrated state-of-the-art performance on tasks ranging from store sales prediction to motion detection to malware classification.

Theory

Decision trees

XGBoost is part of a family of machine learning algorithms based around the concept of a “decision tree”.  A decision tree is a simple rule-based system, built around a hierarchy of branching true/false statements.  E.g. “are they 70 years old?  If so, are they female?  If so, probability of death is 2.19%”.

Considerable effort has been put into finding ways to let computers build the best decision tree automatically.  The current best-of-breed approach is CART (“Classification and Regression Trees”), which takes a recursive approach: it finds the best split for the entire dataset, then finds the best split for each half, then for each quarter, and so on until further splits cease to add value.

“Best” here is typically defined in terms of the Gini coefficient, which you may remember from economics.  In this context it acts as a measure of how well-stratified the resulting populations are, penalising more egalitarian splits that include wildly different values of the target variable in each sub-population.

Ensemble methods

Single decision trees are very easily explainable to non-technical stakeholders; however, they tend to dramatically over-fit or under-fit the data with no easy way of telling which has occurred!  Thus, more recent work in this area has focused on “ensemble” learning methods, which generate multiple decision trees and in some way take a consensus view.

Ensemble methods are widely used across the field of statistical learning: conventional wisdom is that an ensemble of several different models will almost always perform better than the models in isolation.  Common ensemble learning methods include:

  • Bootstrap aggregation (usually abbreviated to “bagging”).  This method involves repeatedly sampling, with replacement, from a population.  A model is trained on each sample, and an average of the models’ predictions is taken as the ensemble prediction.  A common learning algorithm combining bagging and decision trees is the Random Forests classifier.
  • Boosting.  This method involves training a model on the data, determining the model’s predictive inaccuracy for each datum, then training another model to better handle that inaccuracy, and so on.  A common algorithm combining boosting and decision trees is the AdaBoost (“Adaptive Boosting”) algorithm.
  • Stacking.  This method involves training multiple diverse models on the data, then training a final “meta-model” to determine which model (or set of models) is likely to give the best fit in a given situation.

XGBoost, per its name, is a boosted decision tree algorithm.  It is an extension of an approach called Gradient Boosting, which itself is an extension of the AdaBoost algorithm mentioned above.

The differences are technical.  AdaBoost’s approach was to fit a decision tree, look at its prediction errors, then use these as weights on the data when fitting the next tree.  This ensured that model points that had been handled poorly by previous trees would be scrutinised more closely in future.

Gradient Boosting reformulated the problem slightly: rather than build each tree against a re-weighted version of the same data, it would calculate the prediction error for each data point and then treat those residual errors as the target variable for the next tree.  The “gradient” in Gradient Boosting just means that, after each tree has been calibrated, its contribution is scaled down a bit; this leaves more elbow room for the next tree.  This change was motivated by the observation that AdaBoost only worked so well because it approximated this approach!

The Gradient Boosting algorithm had its own issues: it included a number of ad-hoc parameters regarding the growth of decision trees.  XGBoost then standardised these parameters in a way that lent itself to better generalisation of results[1].

Optimisations

XGBoost includes a range of other changes intended to speed up calculation or improve goodness-of-fit.  For example, a common problem with boosting algorithms (unlike bagging algorithms) is that they cannot be parallelised: you can’t carve the workload in half and hand the halves off to different CPUs to run at the same time.  This is because each new tree builds on the results of the previous trees, attempting to eliminate their residual error.  If tree #27 is being trained at the same time as tree #26, how is it to know the final value of tree #26’s residuals?

XGBoost sidesteps this problem.  Rather than attempting to parallelise the training of whole trees, it parallelises the training of different nodes within each tree.  One processor can consider the possibility that the ideal splitting point for a “policyholder age” field is between 20 and 40, whilst another processor considers the possibility that it is between 40 and 100.

Other key optimisations include:

  • Faster search for each tree node’s ideal splitting point (using the “weighted quantile sketch” approach)
  • Better identification of when to stop growing a tree (via the “pruning” method)
  • Effective handling of sparse data such as “one-hot” variables

Strengths and weaknesses

The XGBoost algorithm has many strong advantages:

  • It is very fast at generating predictions, and relatively fast to train compared to most learning algorithms.
  • In particular, it can – to an extent – be parallelised, permitting the training of larger models across multiple processors.
  • It can handle sparse data, which many algorithms struggle with.
  • It gives demonstrably good in a broad range of circumstances.
  • An implementation is available as open source, and is well-supported by both R and Python.

However, it does have a few issues:

  • It’s a black box: determining the logic behind any given prediction is relatively hard (although methods to crack open the box do exist).
  • It performs a very shallow analysis, making no attempt to understand the nature of the underlying system.
  • In particular, predictions are discontinuous – they jump from one value to another at splitting points, even when both the explanatory and target variables are continuous.
  • It is slightly slower than recent competitors such as Microsoft’s LightGBM and CatBoost.

Overall, XGBoost deserves a place in any data scientist’s shortlist of ML algorithms, and wouldn’t look out of place in first position on that list.

Hyperparameters

XGBoost’s structural parameters – those that set the context in which individual trees are fitted – are as follows:

  • Number of rounds.  The number of decision trees to layer on top of each other, with each boosting the last’s performance.  Default: no default – values between 2 and 200 are reasonable.
  • Early stopping rounds.  The number of rounds after which to stop boosting if no performance improvement is seen.  Useful with high number of rounds.  Default: null (feature not enabled).
  • Learning rate (eta).  A scaling multiplier applied to each tree, to reduce overfitting and give future trees more room to grow.  Default: 0.3.

Parameters specific to the fitting of individual decision trees are as follows:

  • Maximum depth or maximum number of leaf nodes.  Controls the size of each tree.  Default: 6 layers or 64 (=26) nodes.
  • Minimum child weight.  The smallest population a leaf node is permitted to have.  Default: 1 (feature not enabled).
  • Minimum loss reduction (gamma).  The smallest impact that a split is permitted to have on the tree’s goodness-of-fit.  Default: 0 (feature not enabled).
  • Subsample.  Randomly-selected proportion of the dataset (rows / observations) to use in the construction of each tree, to reduce over-fitting.  Default: 1 (feature not enabled).
  • Column sample by tree. Randomly-selected proportion of the dataset (columns / features) to use in the construction of each tree, to reduce over-fitting.  Default: 1 (feature not enabled).

Finally, the two regularisation parameters (representing “L2” and “L1” regularisation respectively – details in a future article), are as follows:

  • Lambda.  Penalises trees with larger leaf values, to reduce over-fitting.  Default: 0 (feature not enabled).
  • Alpha.  Penalises trees with larger leaf counts, to reduce over-fitting.  Default: 0 (feature not enabled).

A thorough hyper-parameter tuning process will first explore the structural parameters, finding the most effective number of rounds at an initial high learning rate, then seek the best tree-specific and regularisation parameters, and finally re-train the model with a lower learning rate and higher number of rounds.

Example

In another previous article (“The Data Science Process”), we briefly introduced you to the data science process in its entirety: identifying and cleansing data, performing exploratory analysis, feature engineering, model fitting, communication of findings to stakeholders, deployment of models into production and, finally, monitoring of performance over time.  Future articles will also discuss issues like auditability, regulation and ethical concerns.

For the purposes of this article, we’ll assume that none of that is an issue!  We’re just going to focus in on applying XGBoost in R.  We’ll use a simple train / validate / test data split, with the validation dataset only being split out for the purpose of fitting hyperparameters.  (In practice, a cross-validation approach would be more appropriate.)

As our use case, we will consider default event prediction for a credit card dataset[2].  This is similar to many actuarial use cases, and offers a slightly richer set of explanatory variables than most freely-available actuarial datasets, whilst still being conceptually very simple – no need to worry about frequency/severity splits!  Note, however, that small datasets like this won’t necessarily showcase XGBoost at its best.

Since this is a binary classification problem, we will configure XGBoost for classification rather than regression, and will use the “area under ROC curve” (AUC) measure of model effectiveness.  The AUC, a very popular measure for classification, is – in brief – the proportion of the time that our model correctly assigns higher default probabilities to defaulting debtors than to non-defaulting debtors.  AUC has come in for some criticism over the years[3], but is still the de-facto standard for evaluating classifiers.

We’ll consider two hyperparameters: the maximum tree depth and the learning rate (eta).  Rather than simply seeking the best value of each individual hyperparameter, it is necessary to seek the best pair of values, for example by an exhaustive grid search over all possible combinations.  This is because the hyperparameters can and do interact in quite complex ways: the best value of eta for a given max depth will generally not be the best value over all max depths.

As with many algorithms and workflows in R, there are a wide range of implementation options available for our calculation.  In particular, I’m a huge fan of the “mlr” package, which massively streamlines the creation of a ML model pipeline, the “caret” package which preceded it, and of course the “tidyverse” packages by Hadley Wickham (whose work over the past few years has almost singlehandedly turned R into a modern programming language).  For the purposes of this example, though, we’ll keep the package count to a bare minimum.

The final script[4] takes up about 100 lines of R code.  It trains XGBoost models on both a default set of hyperparameters and a “tuned” set, and compares the outcome with a simple logistic regression model trained on the same data.

Results

This simple analysis gave the following scores:

Model

AUC

Logistic regression

71.7%

XGBoost (default hyperparameters)

75.8%

XGBoost (tuned hyperparameters)

77.9%

That’s a clean victory for XGBoost in general, with a slight further improvement given by parameter tuning.

It’s worth noting that even the simple logistic model performed relatively well – that’s pretty common in our experience, especially for small datasets.  Against this baseline the 6% increase in AUC from moving to XGBoost may seem underwhelming, but – since credit defaults are fundamentally random – we wouldn’t be surprised if the theoretical maximum was not far past this point.  And, for large credit portfolios that increase in model effectiveness could be worth millions.

Next steps

As extensions to this model, we could consider tuning more hyperparameters, using a more sophisticated validation framework, or considering a broader range of measures.  More broadly, and depending on how the model is to be used, we might want to integrate more data sources into our dataset (e.g. economic time series), reshape the data (better exploiting its longitudinal nature), study what is driving predictions (introspection), develop confidence intervals for our predictions, and investigate outliers and anomalies.

Even without these, though, we are left with a model that is demonstrably better-performing than the corresponding classical statistical model.  Machine learning in general, and XGBoost in particular, has proven its worth.

For further information, please see the following resources:

[1] The technical term for a strategy designed to improve generalisation of results is “regularisation”.  Regularisation strategies are seen throughout statistical learning – for example in penalised regression (LASSO, Ridge, ElasticNet) and in deep neural networks (drop-out).  XGBoost employs a similar approach to penalised regression, tweaking the tree evaluation process to promote trees with fewer, lower-valued leaves.  See the 2016 academic paper for more detail.

[2] https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

[3] See for example AUC: A Misleading Measure of the Performance of Predictive Distribution Models (Global Ecol. Biogeogr. 2007)

[4] XGBoost example