WHAT IS CAUSAL INFERENCE AND WHY SHOULD DATA LEADERS AND DATA SCIENTISTS PAY ATTENTION? BY GRAHAM HARRISON

Graham Harrison is a hands-on data scientist with 20+ years of programming experience. He’s a Technology and Data Executive, an Institute of Directors Digital Ambassador, and Founder of the Data Science Consultancy. Graham’s qualifications include an Executive MBA in Leadership and Management from the University of Nottingham, and an Emeritus certificate in Applied Data Science from Columbia.

In this post, Graham explores the concept of causal inference, and the ways it can be used in machine learning. Causal inference, Graham argues, can have important applications for organisations in a number of ways:

Causal inference is the application of the combination of statistics, probability, machine learning and computer programming in understanding the answer to the question “why?”.

In my work as a data scientist I have developed and implemented many machine learning algorithms that produced accurate predictions that have
added significant value to organisational outcomes.

For example, accurate predictions of staff churn allow proactive intervention to support and encourage likely churners to stay and that insight can increase staff productivity and decrease recruitment costs.

However, that may not be enough. Following one successful machine learning prediction project one of the business domain experts approached me and asked, “why are the staff members identified as churners leaving the organisation?”

Dipping into my Data Science tool bag I used SHAP (SHapley Additive exPlanations) to show what features were contributing the greatest weights to the overall prediction and to individual cases.

This helped the customer to understand more about the way the algorithm worked and prompted their next question – “What do I need to change to stop churn happening in the first place rather than just intervening for staff that might leave?” This prompted me to do some research which led to some revelatory conclusions.

PREDICTION DOES NOT IMPLY CAUSATION

We all know the saying “correlation does not imply causation” but at the time I had not appreciated that this is equivalent to “prediction does not imply causation” for machine learning models.

Predictive models use the available features greedily to make the most accurate predictions against the training data. It may be the case that there are no causal links in the features that enable the best predictions and in the world of big data that may not matter.

If a retailer knows that every time it rains sales of butter will increase and this prediction is reliable, they will not care that there is no likely causal link, they will just stock the shelves with dairy products whenever the storm clouds gather.

Another example is the prediction from data that people who go to bed in their shoes are correlated with next-day headaches, but the shoes have no causal link.

Rather the causality may be attributable to the previous night’s alcohol related activity, but an algorithm that relies on the “shoes” feature can make accurate predictions, even though taking your shoes off is no headache cure.

A more serious implication is that I could not use the churn algorithm to answer the “why?” question for my customer. Prediction does not imply causation and hence taking pre-emptive action based on the correlative features in the model has the potential to cause neutral or even harmful outcomes.

The revelation that predictive models could not be used reliably to suggest preemptive and preventative changes to organisations led me to begin a learning journey into causal inference that has lasted ever since.

DESCRIPTIVE, PREDICTIVE AND PRESCRIPTIVE ANALYTICS

Descriptive analytics is the science of looking backwards at things that have already happened and making sense of them through a variety of techniques including graphs, charts, tables, interactive dashboards and other rich forms of data visualisation.

Good descriptive analytics enables leaders to make informed decisions that positively impact organisation outcomes whilst avoiding bias, noise and gut instinct.

Some approaches like data warehousing may involve data that is hours or even days out of date but even the most current descriptive data systems, for example the heads-up display in a fast jet, are still a few microseconds behind the real world.

Predictive analytics bucks that trend by probabilistically predicting what entities of interest like customers, suppliers, products, demand, staff etc. are likely to be doing in the next hour, day, month or year.

The insight gained through those predictions can then be used to inform interventions to improve organisational impact and outcomes (for example more sales or less staff churn).

Prescriptive analytics goes beyond understanding the past and making predictions about the future. It is the business of making recommendations for change informed by data, models and domain expertise that can improve outcomes by fundamentally altering the future that the models were predicting would happen.

For example, in the staff churn model it was identified that month when the staff started their employment was a feature that the model used to inform its predictions.

That naturally leads to the question – “if staff start dates were delayed or accelerated to match the start month associated with the least churn, will churn decrease?”

This is the sort of question that causal inference techniques are starting to ask and answer and the potential to add this type of analysis to the Data Science tool bag is why Data Scientists and leaders should pay attention to causal inference.

WHAT IS CAUSAL INFERENCE?

One potential definition of causal inference is “the study of understanding cause-and-effect relationships between variables while taking into account potential confounding factors and biases ”.

Central to this idea is that causality cannot be established from the data alone, it needs to be supplemented by additional modelling elements to allow cause-and-effect to be proposed, explored, tested, established and used to prescribe outcomes.

For example, if we collected binary data for recording days when a cockerel crowed and the sun rose and asked the simple question “is the sun causing the crowing or vice-versa?” How would we know?

The answer seems obvious and trivial but primitive civilisations had different theories about the cause of celestial events and our intuitive answer is informed by domain expertise – we know that the sun is a 1.4m kilometre, 15m degree Celsius celestial object and that the cockerel is a male chicken that cannot influence the sun.

We have supplemented the observed data with domain knowledge that can be formalised into a diagram like the following…

This type of diagram is called a “Directed Acyclic Graph” (or DAG) which is commonly shortened to “Causal Graph” or just “Graph” and it is the combination of a DAG and the observed data which forms the building blocks of causal inference techniques.

You may be thinking “is this cheating?” as we have leapt from no understanding of the causality to a hard statement about the cause-and-effect. In real-world examples DAGs are developed in consultation between Data Scientists and domain experts and once proposed there are methods for testing or “refuting” them to ensure that the proposal is reasonable.

There is another advantage to developing an understanding of DAGs which is not discussed in books and articles and that does not require any technical understanding of causal inference or the associated statistical and machine learning techniques.

As a Data Scientist and senior manager, I often find myself in management meetings listening intently to leaders’ views on what is happening in their organisations and what they believe the impacts and outcomes are.

Almost unconsciously I have found myself starting to doodle the causal relationships in a rough DAG and then backtracking down the arrows from effect to cause to underlying cause and interjecting in those meetings to ask, “is A the cause of B and how do you know?”

In this respect I am embedding causality into my daily habits and enabling and encouraging others to think the same way. Thinking about causality is adding value and delivering tangible benefits informally without going anywhere near a machine learning model!

CONTROL TRIALS AND CONFOUNDING

There is nothing new about causal methods per se; they have been around for a long time and are tried and tested in statistical and observational methods.

For example, if the causal effects of a new drug need to be established then A/B testing groups can be set up with group A given the drug and group B given no drug or a placebo.

The recovery outcomes of the two groups can then be measured and conclusions drawn about the efficacy of the new drug to inform whether it should be approved, withdrawn or recommended for further testing.

There are several problems with this approach. First, let us imagine that individuals were given the choice whether they wanted to join Group A and take a new drug or Group B and take the placebo.

One likely issue is that the groups could become self-selecting. It might be the case that more fit and healthy people choose to join Group A who are more likely to have a better recovery irrespective of the drug. A potential solution is to randomly select individuals for the groups and to not tell them whether the pill they are taking is the drug or a placebo which overcomes the self-selection problem.

There are other challenges though. What if assigning individuals into the groups were immoral or unethical?

For example, if the study is looking at the effect of smoking or the effect of obesity how could the individuals be forced to smoke or to be obese? In this instance it is impossible to avoid the self-selecting problem without serious ethical concerns.

Yet another problem is that the trial we are interested in may already have taken place. It is possible that the individuals have been assigned, the drug taken and the outcomes and observations carefully recorded but it is too late to influence the group membership.

It turns out that all of these problems can be addressed by understanding and applying causal inference techniques including the apparent magic trick of going back in time to simulate everyone either taking or not taking the drug.

BACKDOOR ADJUSTMENT

Returning to the drug example, what if something that cannot be controlled is having an impact on the trial?

For example, what if males are more likely to take the drug but females have a better recovery rate? In this instance gender will influence both the treatment (whether the drug is taken) and the outcome (the recovery rate).

That sounds like an intractable problem, but the start point is always a Directed Acyclic Graph…

The terminology used in causal inference is that G is confounding D and R. Simply stated, the isolated effect of the drug on recovery is mixed in with the effect of gender on both taking the drug and recovery.

The desired outcome is to “de-confound” or isolate the impact of the drug on recovery so that an informed choice can be made about whether to recommend or withdraw the drug.

The pattern of causality in Figure 2 is called “the backdoor criterion” because there is a backdoor link between D and R through G and this causes the true effect of the drug to be lost because it is mixed in with the effect of gender.

This could be the end of the road for the usefulness of the historical observational data but causal inference techniques can be applied that are capable of simulating the following –

Travelling back in time and forcing everyone in the trial to take the drug.
Observing and recording the impact on recovery.
Repeating the time-travel trick and this time forcing everyone to avoid the drug.
Observing and recording the impact again.
Performing a simple subtraction of the first set of results from the second to reveal the true effect of the drug on recovery.

The “interventions” at points 1 and 3 are expressed as –

P(Recovery=1 | do(Drug=1)) i.e., the probability of recovery given an intervention forcing everyone to take the drug
P(Recovery=1 | do(Drug=0)) i.e., the probability of recovery given an intervention forcing everyone to not take the drug.

Causal inference implements this magic trick by applying something called “the backdoor adjustment formula” –

A detailed explanation of the maths can be found in my article on the Towards Data Science website ¹ but the key takeaway is that the intervention on the left hand side i.e. P(Y Z do(X)) can be rearranged and expressed purely in terms of observational data on the right hand side.

From this point there are only a few more steps to be able to de-confound or isolate the effect of any treatment on any effect no matter how complex the causal relationships captured in the DAG or how big the dataset is.

When I first understood the implications of backdoor adjustment it was a light-bulb moment for me. I suddenly saw the potential for starting to answer all those “whatif?” type questions that domain users inevitably ask in their desire not only to intervene on predictions but create a new future for their organisations.

However, the magic of causal inference does not stop there, it gets even better!

FRONT-DOOR ADJUSTMENT

Returning to the drug example again, let us assume that the gender of the participants was confounding both the drug taking and recovery but that it had not been recorded during the observations –

This pattern is called an “unobserved confounder” and it is very common in causal inference. For example, in the staff churn example the data team became convinced that “staff commitment” was having a causal effect on churn but not only was it not measured but no-one had any idea how to measure it.

In the example in Figure 5 below, if something is confounding both the drug taking and recovery, but nothing is known about it and no measurements have been taken, surely it must be impossible to repeat the magic of backdoor adjustment?

Well, not quite. If the causal relationships are limited to just the treatment and outcome then the effect of the unobserved confounder cannot be isolated, but if there is an intermediary between drug taking and recovery it can be done. Here is an example –

In this example, taking the drug (D) has a causal impact on blood pressure (B) and this change in blood pressure is then having a causal impact on recovery (R). The confounder of both taking the drug and recovery remains unobserved.

When this pattern exists front-door adjustment can be applied to isolate the effect of D on R even where an unobserved confounder affects both –

If you are interested in the maths, please check out my article on the Towards Data Science website ² but again the key takeaway is that the “intervention” expressed on the left hand side can be re-written and expressed solely in terms of observational data on the right hand side. (Please note that in the drug trial example D=X, R=Y and B=Z).

This was another revelatory moment for me. At the time I had begun researching front-door adjustment I had been working on a project that had shown a clear correlation between students engaging in physical activity and positive learning outcomes. However, a theory emerged that more committed students may have chosen to engage in the activity and also studied harder for their exams which might have confounded the effect of the physical activity.

“Learner commitment” was an unobserved confounder and hence the front-door adjustment formula was applied to demonstrate that the activities were having a positive impact on outcomes irrespective of any confounding.

I hope by now you are starting to feel a stir of excitement about the possible applications of causal inference techniques and thinking about how you might apply them to deliver meaningful impact and outcomes.

THE CHALLENGES AND LIMITATIONS OF CAUSAL INFERENCE TECHNIQUES

As with any branch of machine learning there are challenges and limitations that need to be understood and appreciated in order to know when causal inference is and is not an appropriate tool.

One of the biggest challenges is the relatively immature state of the publicly available libraries that provide implementations of causal inference for Python and other programming languages.

There are a number of libraries that are emerging as the front runners. Two that I have used extensively are pgmpy ³ and DoWhy ⁴

Pgmpy is more light weight, making it easy to use whilst DoWhy has more advanced features but is more difficult to get started with. Both have challenges and limitations.

Machine learning algorithms like linear regression and classification have been standardised in the scikit-learn library. All the predictors in sklearn implement the same interface so they are easy for Data Scientists to learn and the interfaces are so intuitive that it is not necessary to have a deep knowledge of the maths in order to use them.

In contrast there is no standardisation across causal inference libraries and whilst DoWhy can do some impressive things it requires a lot of dedicated research and effort to become competent. Also, the documentation is weak and there are nowhere near as many coding examples as there are for more traditional machine learning algorithms.

Beyond the challenges of getting started, all the current causal inference libraries have functionality limitations. For example, pgmpy can perform a backdoor adjustment but it does not work for unobserved confounders. It does not implement front-door adjustment or another common technique called instrumental variable.

DoWhy does implement backdoor, front-door and instrumental variable for a causal calculation called “Average Treatment Effect” (ATE) but it does not work for unobserved confounders in “do” operations.

So if you want to develop causal inference solutions you will have to spend more time learning the theory than you would for a regressor or classifier and you will likely have to dance around the limitations of the existing libraries.

However, I have generally found that the answers are out there in books and online articles if you look hard enough and the number of available resources are increasing in volume and quality all the time.

Another limitation is that the DAG must accurately capture the causal relationships or the calculations will be wrong. There are emerging techniques for testing or “refuting” a proposed DAG but this stage will always require domain expertise and hard work as DAGs cannot be established from the data alone.

In my experience though, domain users enjoy getting involved in working out the causal relationships and well-facilitated workshops and analysis sessions usually produce good DAG models to use in the calculations.

There are also moral and ethical concerns but these are common to all machine learning and artificial intelligence and can be addressed by considering transparency, giving the control to the customers (i.e., automatic opt-out, voluntary opt-in) and by building solutions that deliver clear customer benefit.

THE FUTURE OF CAUSAL INFERENCE

It is natural for human beings to think about “what-if?” type questions, for example –

“Would I have got home earlier if I had taken the bypass rather than driving through town?”
“Where would I be now if I had taken that job opportunity?”
“What would have happened if I had invested my money in Fund A rather than Fund B?”

Whereas there is no empirical proof for this theory, it is reasonable to assume that human beings may create a version of a DAG inside their minds and then re-run different scenarios to imagine what today would be like if yesterday had been different, or what tomorrow might look like based on the choices available today.

Descriptive and predictive analytics will always be mainstays of Data Science, but causal inference will add another set of tools into the Data Science tool bag with the potential to contribute to organisational and societal outcomes by answering the big questions like “why?” and “what if?”.

After all, if you had the choice of having a regression model that could interpolate the expansion of the universe back to 300,000 years after the big bang or a causal model that could tell you why the universe was created, which one would provide you with the most startling insight?

Back to blogs