Context Compatibility in Data Analysis

Roger Peng ** 2018/05/24

All data arise within a particular context and often as a result of a specific question being asked. That is all well and good until we attempt to use that same data to answer a different question within a different context. When you match an existing dataset with a new question, you have to ask if the original context in which the data were collected is compatible with the new question and the new context. If there is context compatibility, then it is often reasonable to move forward. If not, then you either have to stop or come up with some statistical principle or assumption that makes the two contexts compatible. Good data analysts will often do this in their heads quickly and may not even realize they are doing it. However, I think explicitly recognizing this task is important for making sure that data analyses can provide clear answers and useful conclusions.

Understanding context compatibility is increasingly important as data science and the analysis of existing datasets continues to take off. Existing datasets all come from somewhere and it’s important to the analyst to know where that is and whether it’s compatible with where they’re going. If there is an imcompatibility between the two contexts, which in my experience is almost always the case, then any assumption or statistical principle invoked will likely introduce uncertainty into the final results. That uncertainty should at least be communicated to the audience, if not formally considered in the analysis. In some cases, the original context of the data and the context of the new analysis will be so incompatible that it’s not worth using the data to answer the new question. Explicit recognition of this problem can save a lot of wasted time analyzing a dataset that is ultimately a poor fit for answering a given question.

I wanted to provide two examples from my own work where context compatibility played an important role.

Example: Data Linkage and Spatial Misalignment

Air pollution data tends to be collected at monitors, which can reasonably be thought of as point locations. Air pollution data collected by the US EPA is primarily collected to monitor regulatory compliance. The idea here (roughly) is that we do not want any part of a county or state to be above a certain threshold of air pollution, and so we strategically place the monitors in certain locations and monitor their values in relation to air quality standards. The monitors may provide representative measurements of air pollution levels in the general area surrounding the monitor, but how representative they are depends on the specific pollutant being measured and the nature of pollution sources in the area. Ultimately, for compliance monitoring, it doesn’t really matter how representative the monitors are because an exceedance at even one location is still a problem (the regulations have ways of smoothing out transient large values).

Health data tends to be measured at an aggregate level, particularly when it is coming from administrative sources. We might know daily counts of deaths or hospitalizations in a county or province or post code. Linking health data with air pollution data is not possible because of a context mismatch: Health data are measured areally (counts of people living within some political boundary) and pollution data are measured at point locations, so there is an incompatibility in the spatial measurement scale. We can only link these to data sources together if we do one of the following:

Assume that the monitor values are representative of population exposure in the entire county
Develop a model that can make predictions of pollution levels at all points in the county and then take the average of those values as a representative of the average county levels

This problem is well-known in spatial statistics and is referred to as spatial misalignment or as change of support. The misalignment of the pollution and health data is the context mismatch here and arises because of the different measurement schemes that we use for each type of data. As a result, we must invoke either an assumption or a statistical model to link the two together.

The assumption of representativeness is easier to make because it requires no additional work, but it can introduce unknown uncertainties into the problem if the pollution values are not representative of population exposure. If a pollutant is regional in nature and is spatially homogeneous, then the assumption may be reasonable. But if there a lot of hyper-local sources of pollution that introduce spatial heterogeneity, the assumption will not hold. The statistical modeling approach is more work, but is straightforward (in principle) and may offer the ability to explicitly characterize the uncertainty introduced by the modeling. In both cases, there is a statistical price to pay for linking the datasets together.

Data linkage is a common place to encounter context mismatches because rarely are different datasets collected with the other datasets in mind. Therefore, careful attention must be paid to the contexts within which each dataset was collected and what assumptions or modeling most be done in order to achieve context compatibility.

Example: The GAM Crisis

A common way to investigate the acute or short-term associations between air pollution levels and health outcomes is via time series analysis. The general idea is that you take a time series of air pollution levels, typically from an EPA monitor, and then correlate it with a time series of some health outcome (often death) in a population of interest. The tricky part, of course, is adjusting for the variety of factors that may confound the relationship between air pollution and health outcomes. While some factors can be measured and adjusted for directly (e.g. temperature, humidity), other factors are unmeasured and we must find some reasonable proxy to adjust for them.

In the late 1990s investigators started using generalized additive models to account for unmeasured temporal confounders in air pollution time series models. With GAMs, you could include smooth functions of time itself in order to adjust for any (smoothly) time-varying factors that may confound the relationship between air pollution and health. It wasn’t a perfect solution, but it was a reasonable and highly flexible one. It didn’t hurt that there was already a nice S-PLUS software implementation that could be easily run on existing data. By 2000 most investigators had standardized on using the GAM approach in air pollution time series studies.

In 2002, investigators at Johns Hopkins discovered a problem with the GAM software with respect to the default convergence criterion. The problem was that the default convergence criterion used to determine whether the backfitting algorithm used to fit the model had converged was set to 0.0001, which for most applications of GAM was more than sufficient. The typical application of GAM was for scatterplot smoothing to look at potential nonlinearities in the relationship between the outcome and a predictor. However, in models where the nonparametric terms were highly correlated (a situation referred to as “concurvity”), then the default criterion was not stringent enough. It turned out that concurvity was quite common in the time series models and the end result was most models published in the air pollution literature had not actually converged in the fitting process.

Around this time, the US EPA was reviewing the evidence from time series studies and asked the investigators who published most of the studies to redo their analyses using alternate approaches (including using a more stringent convergence criterion). To make a long story short, many of the published risk estimates dropped by around 40%, but the overall story remained the same. There was still strong evidence of an association between air pollution and health, it just wasn’t as large of an effect as we once thought. Francesca Dominici wrote a comprehensive post-mortem of the saga that contains many more details.

The underlying problem here was an undetected shift in context with respect to the GAM software. In previous usage of GAMs, the default convergence criterion was likely fine because there were not strong dependencies between the various smoother components in the model and the relationships being modeled did not have time series characteristics. However, when the same GAM software was used in a totally different context, one which the original authors likely did not foresee, suddenly the same convergence criterion was inadequate. The low-concurvity environment of previous GAM analyses was incompatible with the high-concurvity environment of air pollution time series analysis. The lesson here is that software used in a different context from which it was developed is essentially new software. And like any new software, it requires testing and validation.

Summary

Context shifts are critically important to recognize because they can often determine if the analyses you are conducting are valid or not. They are particularly important to discuss in data science applications here often the data are pre-existing but are being applied to a new problem or question. Methodologies and analytic approaches that are totally reasonable under one context can be inappropriate or even wrong under a different context. Finally, any assumptions made or models applied to achieve context compatibility can have an effect on the final results, typically in the form of increased uncertainty. These additional uncertainties should not be forgotten in the end, but rather should be communicated to the audience or formally incorporated in the analysis.

You can hear more from me and the JHU Data Science Lab by subscribing to our weekly newsletter Monday Morning Data Science.