Author: Alex Labram
Find all articles and further information on the Data Science MIG page.
In our previous article “What Makes a Data Scientist?”, we discussed two key tools: classical statistics and machine learning (ML). At the time, we took it as a given that both were important.
However, our actuarial readers, who will already be very familiar with statistics, may wonder: why is machine learning even needed? Statistics has served our purposes for hundreds of years; why must we learn more?
Where's the beef?
Classical statistics has several key limitations. Firstly, solutions are very specific: small changes in the problem can lead to wildly different mathematical treatment. As a basic example, you may recall from your actuarial exams that the sample mean of a normally-distributed dataset can be treated as a normal variable itself... but only if the population variance is known or the population is large. Otherwise, it has to be analysed using a Student's t-distribution, which has a completely different density function, moment generating function, etc etc.
Secondly, there is no easy way to fit "hyperparameters": parameters that specify the structure of the model. For example, when fitting a linear regression model, we need to first choose which explanatory variables - of which there may be hundreds in a large dataset - to consider. Traditional approaches such as stepwise regression can partly resolve this problem, but are typically inelegant, imperfect and/or slow. As large datasets have steadily become easier to gather and retain in recent decades, this problem only gained in importance.
Finally, it is all too easy to over-fit the data. As John von Neumann famously said: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." Statistical models are normally constructed iteratively, with each attempt improving the model's fit, and it's not always clear where to stop! Formal measures like Akaike's or Bayes' information criteria may provide limited guidance, but ultimately this typically comes down to expert judgement.
Wouldn't it be helpful to have an alternative approach that simplifies our goodness-of-fit calculations, protects against over-fitting, and permits easy hyperparameter tuning?
Divide and conquer
The solution presented by machine learning comes in two parts: a general framework for testing the accuracy of a given model, and a set of sophisticated models that can be used within that framework. The models themselves will be discussed in a later article; here I want to focus on the testing framework.
The approach taken by ML modellers to solving a given well-defined problem is actually pretty straightforward:
- Split the data into two sets: a "training" set (typically 70% of the data), and a "test" set (the remaining 30%).
- Pick a performance measure (e.g. mean squared error).
- Use the training dataset to calibrate your model, determining the values of its parameters.
- Use your model to generate predictions for the testing set.
- Finally, apply the accuracy metric to these predictions, and use the resulting number as a measure of goodness-of-fit.
This is a very simple framework, but already it lets us compare models from different families (e.g. linear vs generalised-linear) against a given dataset. In particular, we are somewhat protected against over-fitting, since any model that too enthusiastically captures the specific properties of the training dataset will fail dismally against the test dataset.
Unfortunately, although we're protected against over-fitting of parameters, we are still not protected against over-fitting of hyperparameters. Imagine for a moment that we have a dataset with 100 explanatory variables, none of which has any real connection to the target variable. In this situation, we would keep trying different combinations of explanatory variables until we got a model that, when calibrated against the training dataset, just happened to give better-than-average predictions on the test dataset. At this point we would - wrongly - conclude that we'd learned something useful about the data!
To protect against this spurious conclusion, we typically split the test dataset further into "validation" and "hold-out" datasets. The validation dataset is used to calibrate each model's hyperparameters (e.g. to choose which explanatory variables our linear model will consider).
The hold-out dataset, on the other hand, is used only once per family of models - we might use it once for the best-of-breed linear model, once for the best GLM, once for the best gradient-boosted decision tree, once for the best support vector machine, and so on. Thus we preserve its ability to act as an independent check on our work.
A common train/validate/hold-out split would be something like 60/20/20. Be aware that, in practice, there are additional subtleties: the train/validate split is often replaced by "k-fold cross-validation", data splits may need to be stratified, special rules apply for longitudinal data, etc. But these are pragmatic tweaks to the basic framework.
What's not to like?
So should we all be using this framework in place of our current statistical approaches? Hold your horses: the train/validate/hold-out approach also has its limitations.
Firstly, and most importantly for actuaries, the machine learning models enabled by this framework tend to be quite opaque. It is often much easier to apply actuarial judgement to statistical models, and to explain their findings to non-technical stakeholders. In later articles we'll discuss how to validate and communicate these "black-box" models.
Secondly, the data requirements are greater, since our training dataset is a mere 60% of the size it would be if we were relying on classical statistics. Where datasets are already very small, or where our conclusions can legitimately be dominated by rare outliers (e.g. Tweedie-distributed claims), this can seriously weaken our conclusions.
Thirdly, the computational requirements are comparatively huge. Now, in theory this shouldn't be such a big deal. This is 2019, after all, when the phones in our pockets have more processing power than a 1980s Cray supercomputer. However, in practice the average low-end work laptop can have trouble running the latest word processing software, let alone fitting a neural network to large datasets... There are ways round this - parallelisation, co-processors, cloud computing - but these create a certain amount of operational overhead and may require actuaries to go further outside their comfort zone into the realm of computer science.
The choice of statistics vs machine learning thus becomes a matter of expert judgement. A sophisticated actuary will use whichever approach is more appropriate to the situation, employing as needed the range of modern software platforms and packages (many being free and/or open-source) that permit both approaches to be executed and compared in the same environment.
To conclude: classical statistics is a powerful tool in the actuary's arsenal, but it has inherent limitations. For some problems - where datasets are large, computer power is cheap, and stakeholders are supportive - it may be more convenient to reframe the problem in terms of a train/test split.
This approach lets us rapidly iterate over more diverse and more intricate models, finding a solution that says more about the data than it does about our preconceptions. With this flexibility, we can derive more accurate predictions that better serve our companies' and clients' needs.
For further information, please see the following resources:
- Statistical Modeling: The Two Cultures (Statistical Science, Vol. 16, No. 3 (Aug., 2001))
- What is the Difference Between Test and Validation Datasets? (Machine Learning Mastery)
- Three Simple Theories to Help Us Understand Overfitting and Underfitting in Machine Learning Models (Towards Data Science)
- The reusable holdout: Preserving validity in adaptive data analysis (Google AI)
- Training, validation, and test sets (Wikipedia)
- The Most Powerful Idea in Data Science (Towards Data Science)