This article discusses an appropriate framework to approach a data science problem. In the context of a health insurance company, we walk through the necessary sequence of steps: framing the research question; collecting and cleaning the data; exploring the data using graphs and other exploratory techniques; finding appropriate models to fit the data; evaluating each model; and, finally, communicating the results to the audience. Second article in the Data Science series. Find all articles and further information on the Data Science MIG page.
Actuaries running or participating in data science projects may find this a useful guide, placing their work in a broader context.
The data science process
Author: Kanika Bhalla
In our previous articles, we discussed what a data scientist does: collecting data, writing algorithms, solving complex problems and communicating the results to their audience in the simplest possible manner.
Data science is an iterative process, with opportunistic improvements being made to data, features, models and visualisations throughout a project’s duration. Whilst there are different ways to approach a data science problem, here I would like to discuss an appropriate data science framework in the context of a health insurance company with the help of a case study predicting whether a person has a heart disease or not. More advanced variants of the data science process will be discussed in later articles.
- Framing the research question
- Data Collection
- Data Cleansing
- Exploratory Data Analysis
- Feature Engineering
- Model Fitting
- Model Evaluation
- Communication of Results
It is important to have a clear understanding of the objective of the data science process. This typically involves a number of discussions between subject matter experts and senior management of the company. The parties deciding the subject of study will focus on maximizing profits, keeping in mind research costs. The research question in our study here focusses on predicting whether a person has a heart disease or not using predictive analytics.
Finding the relevant data
Data scientists can either rely on already existing data sets or gather data using different approaches. If the study utilises existing data, it’s important to consider if the data is relevant and credible enough. At the same time, if the information is gathered from outside for the study, the cost of collecting the required data should be kept in consideration.
One of the more popular ways to share data is via web APIs (Application Programming Interfaces). These allow users to obtain data directly upon request from the database underlying a particular website. The API works as a link between the websites and its users. There are various government organizations as well as companies who have made their own public APIs. At the same time, more traditional ways to gather data such as questionnaires can also be used.
In addition, there are various data repositories available to find relevant data, which are available both online and offline. An example of a publicly available online repository is Kaggle.com, which provides datasets associated with their various competitions. Various packages on R / Python will import data directly from available data sources for offline analysis. As the case study here entails prediction of whether a person has heart disease or not, the data can be collected internally by the health insurance company specifically for this study or the company can rely on already existing data collected for some other purpose.
Finding the relevant data to complete our research is one of the most crucial tasks in the whole data science process.
Data needs to be refined before moving forward in the modelling process. The task of cleaning up the raw data generally involves:
- Substituting in data for any missing values (“imputation”). There are various methods available to populate the missing values: using a natural default value (e.g. “Other” for a categorical variable), taking the average of present values, or even fitting a miniature statistical model to back out the most likely value based on other fields. Choosing the right substitute for the missing value is really situational.
- Correcting any clearly wrong information by performing some reasonableness checks on the data. In our case study, one of the fields we can check is age: if someone’s age is 2 years, that would raise an alarm and might need to be adjusted.
This step might also include converting data from one format to another and consolidation of data. For example, if there are multiple Excel/CSV files available, consolidating these into one single dataset.
In this step it’s important to decide which tool should be used for the data refining process – for example R, SQL, or Python. All these tools are more-or-less equally equipped to perform the required data refining as well as further basic data analysis. The choice of tools depends on availability of model experts, costs involved in using a particular tool, etc.
Exploratory data analysis
Before the process of fitting the model to the data, it is vital to understand the structure of the data used. This begins with studying the key dependent and independent variables and their different data types: nominal, ordinal and categorical being most common. In reference to our case study, our dependent variable is heart disease count and our independent variables are age, gender, chest-pain type, blood pressure, cholesterol, maximum heartrate achieved, etc.
There are various data visualization tools and techniques available to understand the distribution of the variable in the data and the relationship between the dependent variable and other independent variables. For example, the ggplot2 package in R can be used to draw boxplots, density plots, barplots, etc. To understand the relationship between heart disease count and maximum heart rate achieved in our case study, they can be plotted against each other using visualization techniques. If the graph depicts that the proportion of the population having heart disease increases with the increase in heart rate, it can be believed that data is moving as expected.
Data can be rearranged in a more algorithm-friendly format using the art of Feature Engineering. There are various packages available in R and Python which help in this analysis. For example, the dplyr package in R is used in aggregating, filtering, creating new variables, etc, and the expss package in R helps in framing data tables.
It is equally important to understand the correlation between key variables used in the analysis. A high degree of correlation is a problem because many model types expect to deal with independent variables and this dependency between variables might cause problems when you fit the model. Therefore, one of the steps before beginning the actual model fit is often to eliminate variables with high degree of correlation in the data.
Building a data model to answer the research question involves deciding which validation framework and modelling approach to use. As discussed in earlier articles, one of the most popular classical approach to model data is the humble regression model. Modern machine learning approaches like Decision Trees, Random Forests and Neural Networks are just as easily available and widely used. The choice of modelling approach depends on various factors such as the research question, type of dependent variable (binary, categorical, ordinal or numerical), data in hand, costs, etc. Some models are particularly suited to dealing with a large number of variables while others are not. The model fitting process becomes more of a trial and error approach to find the right fit to the required data.
There are a number of ways we can evaluate a machine learning model: accuracy matrices, ROC curves showing the AUC (area under the curve) ratio, pseudo R-squared. As discussed in previous articles we would typically split our dataset into training and test datasets. The training dataset would be used in the model fitting process to develop a predictive model of heart disease. The test dataset would then be used to compare out-of-sample predictions from the fitted model with the actual values in the test data.
Communication of results
Presentation of results from the data science process to a non-technical audience – possibly including senior management, shareholders and directors – is the final step and one of the most vital steps in the analysis. The results should be presented keeping in mind the research question at the start as well as the ability of the audience to interpret the results from the model.
The results from the data science process should be presented in a way that provides the stakeholders with actions that can be practically incorporated in the company’s systems and processes as well as providing ongoing support to the stakeholders involved. For example, if we are able to fit a model to the heart disease data, a health insurance company should be able to input health data from their current customers into the model to predict the probability of their customers having a heart disease or not. This will help them in reserving for the risk involved as well as charging premiums to their customers accordingly.
Crucial data-driven decisions can be best made by formulating a clear approach to a data science problem. These decisions require clear understanding of the process from the start to finish.
Therefore, data scientists can approach a problem using the above sequence of steps. However, the data science process might not be linear in all the situations. There might be situations where the research question was not right in the first place and it is framed during the data analysis process.
Even so, this framework can be an extremely useful guideline to start a data science project, especially for newcomers to the data science world.