It′s an unfortunate fact of life that data are not well-behaved. "Outliers"--unusual data values--crop up in most research projects involving data collection.
This is especially true in observational studies where data may naturally have unusual values, even if they come from reputable sources. Data entry errors or rare events (such as a thermometer left in the sun, a change in accounting practices, or a subject who has a sudden muscle spasm)--all these and many more are reasons for outliers to exist in a dataset. 数据挖掘研究院
Likely Sources of Outliers
Data errors. When looking for the source of outlying observations, first check for data recording or entry errors. To reduce the occurrence of data recording errors, use a spreadsheet program such as EXCEL for data entry. With large datasets, computer programs can be written to identify data entry errors. SAS is a particularly good tool for this purpose.
"Rare" event syndrome. Another reason for outliers is the "rare" event syndrome--extreme observations that for some legitimate reason do not fit within the typical range of other data values. Such unusual observations might include
- a 70 degree day in January in Oregon
- a 500 point rise/drop in a stock market index
- an unusually high score on an aggressiveness scale for a troubled child
All these events may be quite unusual, but they′re still part of the overall picture.
Why Are They a Problem?
Developing techniques to look for outliers and understanding how they impact data analysis are extremely important parts of a thorough analysis, especially when statistical techniques are applied to the data.
For example, in the presence of outliers, any statistical test based on sample means and variances can be distorted. Estimated regression coefficients that minimize the Sum of Squares for Error (SSE) are very sensitive to outliers. 数据挖掘研究院
There are several other problematic effects of outliers, including
- bias or distortion of estimates
- inflated sums of squares (which make it unlikely you′ll be able to partition sources of variation in the data into meaningful components)
- distortion of p-values (statistical significance, or lack thereof, can be due to the presence of a few--or even one--unusual data value)
- faulty conclusions (it′s quite possible to draw false conclusions if you haven′t looked for indications that there was anything unusual in the data)
The following example may seem a bit extreme, but real data with this feature actually exist. The results vividly demonstrate the potential problems that lurk in the background due to unusual data values. 数据挖掘研究院
| Mean | |||||||||
| Real Data | 1 | 3 | 5 | 9 | 12 | 5 | 6.0 | 20.00 | [0.45 to 11.55] |
| Data w/Error | 1 | 3 | 5 | 9 | 120 | 5 | 27.6 | 2676.8 | [-36,630 to 91.83] |
The first four data values across each row contain the same numbers. However, in the second row, the fifth entry has a large discrepancy when compared to the value in the first row. Note that in the presence of one outlier, the median does not change in this example.
The median is called robust (i.e., it usually does not vary greatly) in the presence of a small number of outliers and is often the preferred summary statistic for the "center" of a skewed distribution. Notice how just one outlier can greatly distort the mean, variance, and 95% confidence interval for the mean. Similar results apply to regression, analysis of variance, or any technique that uses sums of squares in the calculations. 数据挖掘研究院
How to Detect Outliers
The "normal" distribution myth. For many statistical modeling purposes, input data do not necessarily require a "normal" or symmetric, bell-shaped distribution. (This feature applies primarily to residuals from a statistical model--a subject for future articles.) Discrete data or counts, by definition, will not usually look very "normal."
In fact, for data to be used in linear regression model, the independent or explanatory variables should not have a normal distribution. It can be demonstrated mathematically that normality is not required nor even desirable for this type of data. What is important is to check for data values that lie well outside the range of other data (called leverage points) that can have a undue influence on the results. Your objective should be to collect data with a distribution that allows you to make the best inferences possible about the population under study. 数据挖掘研究院
Visual aids. Check the distribution of data values by levels of a categorical variable, if available. This procedure should always be one of the first steps in data analysis, as it will quickly reveal the most obvious outliers. 数据挖掘研究院
For continuous or interval data, visual aids such as a dotplot or scatterplot are good methods to examine how severe any outlying observations actually are. A boxplot is another very helpful tool, since it makes no distributional assumptions nor does it require any prior estimate of a mean or standard deviation. Values that are extreme in relation to the rest of the data are easily identified.
Univariate tests check for the presence of outliers; however, many of them are designed to check for the presence of only one outlier, and they also make distributional assumptions which are often not relevant (e.g., they assume a normal distribution when you have very skewed non-negative data). They often require that a location (mean) or scale (standard deviation) parameter be estimated from the data. As shown earlier, outliers can greatly affect their values. This is one reason why "eliminating data that exceed two or three standard deviations" may not be a good, or even a reasonable, rule of thumb.
IQR computation. It is quite simple to compute the inter-quartile-range (IQR) and then use a multiple of it as a number that defines what values are considered outliers. A boxplot uses this technique to identify outliers. Using a boxplot is an extremely effective approach, especially when working with large datasets that have continuous data.
One way to implement an IQR computation is to use PROC UNIVARIATE with SAS and save the order statistics available with its OUTPUT statement. The first quartile (q1), third quartile (q3), and inter-quartile range (IQR) can be saved in an output file. You can use them to flag observations that lie outside of q1-(1.5*iqr) and q3+(1.5*iqr) as potential outliers and anything outside of q1-(3*iqr) and q3+(3*iqr) as problematic outliers. 数据挖掘研究院
Multivariate outliers can also lurk undetected in an analysis. Univariate tests for outliers are not designed to identify multivariate outliers. For two data values, x1 and x2, neither one may be considered a univariate outlier when looked at with a univariate test as described above.
However, the combination of their two values can lie outside the periphery of the range of data in two-dimensional space--in this case the two values are called an influential or leverage point that, for example, can exert a strong impact on the computation of regression coefficients.
What Should You Do About Them?
Effectively working with outliers in numerical data can be a rather difficult and frustrating experience. Neither ignoring nor deleting them at will are good solutions. If you do nothing, you will end up with a model that describes essentially none of the data--neither the bulk of the data nor the outliers. Even though your numbers may be perfectly legitimate, if they lie outside the range of most of the data, they can cause potential computational and inference problems. Some possible approaches to working with outliers are listed below.
Transformation. Transforming data is one way to soften the impact of outliers since the most commonly used expressions, square roots and logarithms, shrink larger values to a much greater extent than they shrink smaller values. However, transformations may not fit into the theory of the model or they may affect its interpretation. Taking the log of a variable does more than make a distribution less skewed; it changes the relationship between the original variable and the other variables in your model. In addition, most commonly used transformations require non-negative data or data that is greater than zero, so they are not always the answer.
Deletion. Only as a last resort should you delete outliers, and then only if you find they are legitimate errors that can′t be corrected, or lie so far outside the range of the remainder of the data that they distort statistical inferences. When in doubt, you can report model results both with and without outliers to see how much they change.
Data transformations and deletion are important tools, but they shouldn′t be viewed as a cure-all for distributional problems associated with outliers. Transformations and/or outlier elimination should be an informed choice, not a routine task.
Accommodation. One very effective plan is to use methods that are robust in the presence of outliers. Nonparametric statistical methods fit into this category and should be more widely applied to continuous or interval data. When outliers are not a problem, simulation studies have indicated their ability to detect significant differences is only slightly smaller than corresponding parametric methods. There are also various forms of robust regression models and computer intensive approaches that deserve attention. 数据挖掘实验室
Summary
Despite the difficulties, exploring why outliers exist can provide many clues to the development of better models. In fact, many great discoveries in human history can be traced to a researcher exploring some outlying or unusual value. Outliers may indicate that an important range of the data has been ignored that is worth knowing about. This article only skims the surface of dealing with outliers. It′s presented with the hope that looking for unusual data values will become a regular part of your analysis, and that your research objectives and knowledge of your subject matter will help you decide what to do with them once you find them. The "common sense" test is often the best solution.
Always apply exploratory data analysis techniques that look for both univariate and multivariate outliers and then evaluate their impact on the results. This will help you reach conclusions that are in line with your research objectives. 数据挖掘研究院

