Author: Alex Labram
Machine learning can often feel like a black box. What’s inside, and can you trust it? This article briefly covers six key groups of algorithms, discussing their key variants, strengths and weaknesses, and giving you the context – and confidence – to apply these algorithms in the field.
Next in the series of articles on Data Science. Find all articles and further information on the Data Science MIG page.
In our previous article “Statistics vs Machine Learning”, we discussed a general framework for solving problems – and validating solutions to those problems – using machine learning (ML). However, we deliberately refrained from going into much detail about the specific learning algorithms used.
This is obviously insufficient. As actuaries, we can’t just take these black boxes on trust, even if we can demonstrate conclusively that they contain good solutions. We want to understand what’s inside each box: its rationale, its strengths and its limitations. More urgently, as machine learning practitioners, we want a decent range of ideas we can throw at the wall to see which sticks!
This article will discuss a sample of key machine learning algorithms in widespread use today, and when they might be of use. Let’s unwrap those boxes…
Box 1: Penalised regression
Most actuaries will be familiar with regression, in both the straightforward linear variety and the more complex family of generalised linear models (GLMs).
Practitioners will also be familiar with the immense amount of pain involved in whittling down the explanatory variables for a given problem to a manageable number. Do healthcare claims depend strongly on a policyholder’s gender, or just their age? Are scheme members in London visibly more inclined to transfer out than those in Derby?
Penalised regression attempts to solve that problem at the source. Where traditional regression can be seen as minimising some error statistic – sum of squared errors for linear regression – penalised regression changes the statistic to include a measure of how gratuitously complicated the model is. This gives us a best solution that is, hopefully, more straightforward.
There are three commonly-used penalisation schemes, each of which may be used with either linear or generalised linear regression.
- Lasso regression punishes regression models based on the sum of the absolute values of their coefficients, and thus encourages the removal of extraneous terms.
- Ridge regression punishes the sum of squares of coefficients, and has proven more stable than traditional regression for multicollinear data.
- ElasticNet regression uses both the above penalties in combination, and is often thought to be the best of both worlds.
Box 2: Decision trees
A decision tree is a simple rule-based system, built around a hierarchy of branching true/false statements. E.g. “Is the policyholder 70 years old? If so, are they female? If so, probability of death is 2.19%”.
Decision trees can – and often are – developed entirely by humans. However, considerable research has gone into finding ways for computers to build their own. The leading approach is CART (“Classification and Regression Trees”), which takes a recursive approach: it finds the best split for the entire dataset; then finds the best split for each half; then for each quarter; and so on until further splits cease to add value.
Single decision trees are very easily explainable to non-technical stakeholders; however, they tend to dramatically over-fit or under-fit the data, with no prima facie way of telling which has occurred! Thus, more recent work in this area has focused on “ensemble” learning methods, which generate multiple decision trees and in some way take a consensus view. In particular:
- The Random Forest algorithm fits a large number of decision trees to different samples from a population, and takes an average of their predictions. This approach is called “bootstrap aggregation” or “bagging”.
- Adaptive Boosting (“AdaBoost”) and Gradient Boosting (e.g. “XGBoost”, “LightGBM” and “CatBoost”) fit a series of trees to the data and takes a weighted sum of their predictions. Each new tree in the series is chosen to further refine the accuracy of the ensemble. This approach is called “boosting”.
Box 3: Bayesian networks
One of the Holy Grails of modern machine learning researchers is the detection of causality – not just correlation – in data. A strong stab at this was made in the 1980s with Bayesian Networks (BNs), which attempt to represent patterns of causality as graphs.
For example, if we’re looking at the risk of cardiovascular disease, a strong association with low socioeconomic status is apparent. But how is this effect conveyed: by under-medication, by poor food quality, or by “diseases of despair” such as opiate abuse? And do any of these affect each other, with (for example) drug abusers avoiding medical professionals? A BN analysis would hypothesise several different patterns of causality, then seek to find the one that best matched the available data.
Unfortunately, the more powerful classes of BNs have proven to be computationally intractable: the number of possible networks increases exponentially with the number of causes or intermediate steps under consideration. Whilst some solutions have been proposed, and some specific variants are still useable, this approach has largely fallen out of fashion in recent years.
However, in the wreckage of BNs, one diamond has been found. By making some ridiculously simple assumptions – all possible causes act independently of each other, only affecting the final outcome – we get a remarkably effective tool called the “naïve Bayes classifier”. This has been used to great effect in spam filtering, for example.
Box 4: Splines
Anyone who has browsed an art supplies store has probably seen a spline. They’re long thin strips of heavy plastic that take a bit of effort to bend – and resist being bent too sharply – but once bent retain their shape. They are used to draw wonderfully smooth curves in a variety of elegant shapes.
In the context of ML, a spline is… pretty similar really. It’s an algorithm that attempts to draw a smooth (low-curvature) line through data, for example by gluing together a bunch of piecewise-cubic curves. Historically, splines were most commonly used for interpolation – drawing a curve that passes precisely through every point in the dataset – but we’re more interested in their use for smoothing of noisy data.
Whilst simple cubic splines can be relatively effective for this, they leave a number of open questions. How many cubic equations do you knot together to make the spline? Where should the knots be located? How do we handle input data with more than one field?
More recent research has provided standardised answers to these questions, with the current industry leader being an algorithm developed in the ‘90s called Multivariate Adaptive Regression Splines (“MARS”). This algorithm uses a similar recursive approach to decision trees, repeatedly laying down additional piecewise-linear functions until no further improvement is possible. MARS has proven quite effective compared to other more sophisticated ML algorithms, being both very fast to train and relatively good at fitting arbitrary data.
Alternatively, for very noisy data, a traditional generalised linear model may be modified to use splines. Instead of taking a weighted sum of the input variables, these “generalised additive models” (GAMs) instead take a sum of spline functions of the input variables before applying the link function. GAMs are far more flexible than traditional GLMs, as they decouple the non-linearity of predicted values from the non-normality of error distributions, but take correspondingly more effort to train.
Box 5: Support vector machines
What splines do for regression problems, Support Vector Machines (SVMs) do for classification problems. SVMs formalise the idea of just drawing a line between the data points in each category.
The devil is, of course, in the detail. In its original form, the SVM algorithm attempts to identify the points that are closest to an ideal dividing line between two categories – that line’s “support vectors” – and uses these to decide where to cut. Later variants could handle overlapping categories (a so-called “soft margin”) and regression problems (where the margins are taken relative to an underlying gradient).
More challengingly, modern SVMs can also handle curved lines between categories. This is done via a transformation called a “feature map”, which unfolds your data out into additional dimensions – turning an Ordnance Survey map into a landscape – and only then slices through it. As anyone who has used a map’s isocline lines to check their altitude knows, this approach can give us regions of pretty much any shape; even circular if you’re climbing a hill. The magic of SVMs is that, via the so-called “kernel trick”, they can perform all this analysis relatively efficiently, for a range of unfolding functions (“kernels”) including polynomial and radial / Gaussian.
However, once this rather intricate machinery is sorted out, the resulting algorithm proves to be surprisingly general. For example, if the input data is a series of 28x28-pixel greyscale images, we can treat the classification problem as drawing the best cut through a 784-dimensional space. In this context, SVM was at one point the leading algorithm for handwriting recognition, although it has since been superseded by deep learning architectures.
Thinking outside the box: Deep Learning
The field of machine learning draws from a wide range of academic sources: operational research, decision theory, graph theory, computational biology, Bayesian statistics, and many more. However in recent years a particular sub-field has emerged as having its own techniques, use cases and community. This is the field of “deep learning”.
Deep learning is a euphemism for the use of “neural networks”: a biologically-inspired arrangement consisting of multiple layers of artificial neurons. Each neuron takes a number of inputs – which may include the output of a previous layer’s neurons – and returns a weighted sum of these, scaled by a simple “activation function”. For comparison, a single neuron with a sinusoidal activation function is basically a logistic regression model… but a typical network would have hundreds, thousands or even millions of coefficients or “weights”.
The network’s prior knowledge is captured in these weights, which may be trained by a process called “back-propagation”: determining the contribution of a given weight to the error in a given prediction, using the kind of differential calculus that most of us will have learned in secondary school, and then adjusting the weight so as to minimise that error. The adjustment itself is generally handled via “gradient descent”: a fancy way of saying that we only apply a fraction of the theoretically-correct adjustment, to prevent our learning process from over-shooting the best solution.
Until recently, neural networks could be extended to only four or five layers: after this, they quickly became un-trainable. However, this problem was definitively solved over the last two decades, with improvements to dataset sizes and training speeds. Oddly, one critical innovation turned out to be the way in which weights were randomly initialised: the approaches used prior to 2006 tended to result in weights dropping to zero (“dying”) or shooting off to infinity. Another unexpectedly huge gain came from moving to simpler (and thus more quickly differentiable) activation functions. From these apparently trivial changes was born the era of massively multi-layer neural networks, also known as deep learning.
Compared to other algorithms (often collectively referred to as “statistical learning”, in deference to one of the better textbooks on the subject, or as “symbolic learning”), deep learning can solve problems where the data has a more sophisticated internal structure. Where inputs need to be analysed at multiple levels of abstraction simultaneously – pixels to edges to shapes to cat photos – neurons in different layers can learn to capture different parts of the problem. These multi-level problems include image recognition, machine translation, protein structure prediction (“folding”) and game AI, for all of which deep learning currently leads the pack.
However, its strengths in these areas are less apparent for typical actuarial datasets such as claims or longevity analysis. And its high resource requirements – in terms of data, training time, computer power and specialist expertise – mean it is usually impractical for our purposes. This is reflected by the fact that most sophisticated neural network “architectures” (sets of design decisions around neuron behaviour and interaction and layer organisation) are optimised for visual or text analysis.
This may change in future, as the deep learning community is starting to downscale its tools to smaller use cases. In the meantime, a relatively shallow neural network, with between one and five layers, is a perfectly practical machine learning algorithm in its own right.
…But what shape of box?
For the purposes of this article, we have limited ourselves quite rigidly to the kind of algorithms one would use with typical actuarial data. That is to say: a two-dimensional (“tabular”) dataset, consisting of a number of explanatory variables (or “features”) and a single target variable, with each row (or “observation”) representing a distinct data point such as a policy-holder.
Algorithms suitable for static datasets with a predefined target variable are often referred to as “supervised learning”. They fit cleanly into the machine learning validation framework we outlined in our previous article: splitting the dataset into training and testing subsets, fitting the model and its hyperparameters against the training set, then using the testing set to determine overall performance.
We haven’t looked at “unsupervised” learning – the use of ML to split data into clusters without reference to a set of pre-existing labels. This is the school of ML techniques that, applied to a dataset of online pictures, infamously identified the cat photo as the quintessential internet image. It also includes dimensionality reduction techniques, such as the classic Principal Components Analysis, and anomaly detection. We have also excluded “reinforcement” learning: the iterative use of ML to react to a changing environment, as used recently to develop best-of-breed board game and computer game AIs.
We have assumed that our dataset is neither too large for standard techniques to function, nor too small for reasonable conclusions to be drawn via an ML validation framework. Our envisaged dataset would be entirely known at the time our analysis started; whilst sophisticated tools exist for dealing with “streaming” and “real-time” data analysis, they are too specialised for most actuarial use cases.
Finally, we have made the strong assumption that our data points are reasonably independent: different rows do not represent different claims on the same policy (“longitudinal” data), different pixels of the same image, or different properties of the same entity (such as connections on social media). Standard ML algorithms can often be remarkably effective for these data types, but it’s not guaranteed.
All of these exclusions are fascinating topics in their own right, which we would thoroughly encourage you to investigate.
Wrapping up the box
In this article, we have briefly covered six groups of algorithms that you may find yourself using in your own data science projects:
- Lasso, Ridge and ElasticNet penalised regression
- Decision trees, Random Forests and Gradient Boosting
- The Naïve Bayes classifier
- Multivariate Adaptive Regression Splines and Generalised Additive Models
- Support Vector Machines with linear, polynomial and Gaussian kernels
- Shallow neural networks
This isn’t a complete list by any means; there are many other algorithms out there, with new variations constantly emerging. Ultimately it is impossible for most actuaries to keep up with them – we all have day jobs!
Rather, the goal is to be ML-literate: to stand on the shoulders of giants, and thus ensure that the innate capabilities of our modelling approach are rarely if ever the limiting factor on our analysis. By getting to grips with the algorithms in this article – and learning how to use a handful of them in the field – an actuary can comfortably meet this standard.
 See this 2016 plenary deck for technical details.
 See for example Comparison of Four Machine Learning Methods for Generating the GLASS Fractional Vegetation Cover Product from MODIS Data (Yang, Jia, Liang, Liu and Wang, 2016).
 See What is the Kernel Trick? Why is it important? on Medium.com.
 See A ‘Brief’ History of Neural Nets and Deep Learning – highly recommended as a technically-literate yet accessible introduction to the field.
 A common sinusoidal activation function is the hyperbolic tangent: (e2α - 1)/(e2α + 1). Sinusoidal activation functions have fallen out of favour in recent years in favour of simpler piecewise-linear functions, which have proven equally effective and much faster to analyse.
 An Introduction to Statistical Learning with Applications in R (James, Witten, Hastie and Tibshirani, 2013) – available for free download via the author’s website.
 For a full history of the distinction, see the paper Neurons Spike Back (Cardon, Cointet & Mazières, 2018). In general it may be said that statistical (a.k.a. symbolic) learning attempts to imitate the behaviour of the human mind, whilst deep (a.k.a. connectionist) learning attempts to imitate the behaviour of the human brain.
 See for example the ArXiv paper Neural Oblivious Decision Ensembles (Popov, Morozov and Babenko, 2019), which develops a neural network that operates along similar lines to a random forest.
 Google computer works out how to spot cats – BBC News.
 For chess, go and shogi, the AlphaZero algorithm is industry-leading. For StarCraft II, which has emerged as a useful test rig for game AI methods, the AlphaStar algorithm – still under very active development – is already capable of beating most human players.