Over the last few months, I’ve had a lot of conversations with people about the use of winsorization to deal with heavy-tailed data that is positively skewed because of large outliers. After a conversation with my friend Chris Said this past week, it became clear to me that I needed to do some simulation studies to understand the design space of techniques for dealing with outliers.
In this post, I’m going to write up my current understanding of the topic. To motivate my perspective, I’m going to describe a specific inferential problem. Trying to solve that problem will require me to explore a small portion of the design space of algorithms for dealing with outliers. It is not at all clear to me how well the conclusions I reach will generalize to other settings, but my hope is that a thorough examination of one specific problem will touch on some of the core issues that would arise in other settings. I’ve also put the code for this simulation study on GitHub so that others can easily build variants of it.
To simplify things, I’m going to define an outlier as any value greater than a certain percentile of all of the observed data points. For example, I might say that all observations above the 90th percentile are outliers. In this case, I will say that the “outlier window” is 10%, because I will consider all observations above the (100 – 10)th percentile to be outliers. This outlier window parameter is the first dimension of the design space that I want to understand. Most applications employ a very gentle outlier window, which labels no more than 5% of observations as outliers. But sometimes one sees more radical windows used. In this post, I will consider windows ranging from 1% up to 40%.
Given that we’ve flagged some observations as outliers, what do we do with them? I’ll consider three options:
-
Do nothing and leave the data unadjusted.
-
Discard all of the outliers. The removal of extreme values is usually called trimming or truncation.
-
Replace all of the outliers with the largest value that is not considered an outlier. The replacement of extreme values is usually called winsorization.
As a further simplification, I’m only going to consider a single inferential problem: estimating the difference in means between two groups of observations. In this problem, I will generate data as follows:
-
Draw (n) IID observations from a distribution (D). Label this set of observations as Group A.
-
Draw (n) additional IID observations from the same distribution (D) and add a constant, (\delta), to every one of them. Label this set of observations as Group B.
In this post, the distribution, (D), that I will use is a log-normal distribution with mean 1.649 and standard deviation 2.161. These strange values correspond to a mean of 0 and a standard deviation of 1 on the pre-exponentiation scale used by my RNG.
Given that distribution, (D), I’ll consider three values of (\delta): 0.00, 0.05 and 0.10. Each value of (\delta) defines a new distribution, (D^{\prime}).
To contrast these distributions, I want to consider the quantity: (\delta = \mathbb{E}[D^{\prime}] – \mathbb{E}[D]). The outlier approaches I’ll consider will lead to several different estimators. I’ll refer to each of these estimators as (\hat{\delta}). From each estimator, I’ll construct point estimates and interval estimates; I’ll also conduct hypothesis tests based on the estimators using t-tests.
Given that we have two data sets, (A) and (B), applying outlier detection requires us to make another decision about our position in design space: which data sets do we use to compute the percentile that defines the outlier threshold?
I’ll consider three options:
-
Compute a single percentile from a new data set formed by combining A and B. In the results, I’ll call this the “shared threshold” rule.
-
Compute a single percentile using data from A only. In the results, I’ll call this the “threshold from A” rule.
-
Compute two percentiles: one using data from A only and another using data from B only. We will then apply the threshold from A to A’s data and apply the threshold from B to B’s data. In the results, I’ll call this the “separate thresholds” rule.
Combining this new design decision with the others already mentioned, the design space I’m interested in has three dimensions:
- The outlier window. The method for addressing outliers, which is either (a) no adjustment, (b) trimming, or (c) winsorization.