Predictive analytics - underlying logic is a causal model

The June issue of PT In Motion has an article called: The Power of Prediction. The article proclaims: “While the use of predictive analytics in physical therapy is still in its infancy, early adopters see great potential for it to transform the profession and society.”

Predictive analytics relies on an underlying causal model either implicitly of explicitly. The value of making that model explicit is that users have an opportunity to learn about the assumptions in the model, and hence in the predictions being made from the model. These assumptions are separate from the statistical estimates of the standard error in the model, and thus separate from the estimates of the confidence interval.

Recall from an earlier post on models - “All models are wrong, some models are useful.” George Box.

PTs will need to be well versed in modeling (causal modeling) and statistical inference to be consumers of the various products available to them for predictive analytics. For example, being told the confidence interval of a model’s estimate is very different than being told how well the model did in accurately predicting the outcome in a training data set, versus a testing data set, versus some external data set. Predicting an outcome in a training data set is when the original sample from which data is collected is randomly divided into two groups. One group is used to train the data set (that is derive the parameters for the equations linking the variables), and the second group is to test model by attempting to predict the outcome based on the derived parameters for the equations. Using training and training data sets has the benefit of being the same sample. If the sample is large enough, and if it is randomly divided then there should be no statistically significant differences in attributes between the testing and training samples. So high predictive ability with a testing data set is an important first step, but it still does not tell you how the predictive model will fair against another sample. What differences between the samples matter the most? Those that are based on assumptions in the model, based on un measured, un observed, latent variables.

Using a previously posted model (here) published on fall risk:

dagitty-model

A predictive model of fall risk based on this causal structure would simply need two variables, Decreased LE Power (binary) or LE Power (continuous) and Balance (binary or continuous). A model could be trained and tested in a sample of similar characteristics - let’s say all elderly subjects (over 70 years) living in an assisted living community.  Training group data yields parameters for model equations, testing group data is used to test whether the model equations accurately predicted fall risk in the testing group. These two groups, based on the same sample, are similar in any unmeasured ways as people over 70 living in an assisted living community. This model is therefore best when tested with the same sort of sample. Not a sample living in a nursing home, or a sample living at home, not a group more or less active than the training  and testing sample. Where people attempt to walk will influence their risk (i.e. active life style) in a way that this model does not account for; and the impact of having very low power is out of the range of this model’s parameters for a group in a nursing home. This model does not account for cognitive status. Which could be very important in the sample that is in the nursing home where attempting to get up when they shouldn’t is a major cause of falling.

Predictive analytics are based on causal models. Data allows us to test for associations and to derive parameters. A rationale process of interpretation that fits that data, and uses it to test the assumptions of a causal model is necessary. Training complex models without any consideration of the causal model is less helpful to the underlying clinical reasoning as a predictive analytic tool. Trying to say that a big data predictive model is “associational” - that is purely based on statistical associations and not causes is an assumption that undermines the entire purpose of a predictive model. Only when we understand the cause - effect structure can you claim to predict future events based on values of causes and resultant (future) changes to effects. The causal structure helps determine the generalizability of the model and what samples it is most likely to result in the same predictive accuracy.

For students taking pathology what does this mean? It means that if you are looking to make a DAG for prognosis (predictive) - you simply make a DAG of the cause - effect structure that focuses on pathomechanisms to clinical manifestations since clinical manifestations would almost always include the set of outcomes we are interested in predicting. Remember, clinical manifestations include the consequences of pathomechanisms.

Leave a Comment