RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎
当前位置 :| 首页>数据挖掘知识>异类>

Outliers and Data Having Undue Influence

来源: 作者:unkonwn 时间:2004-12-04 点击:

In using the powerful analysis techniques we have learned there is a potential problem for which we must be constantly vigilant. We know that SSE and estimators which minimize SSE are extremely sensitive to outliers. 数据挖掘研究院

Outliers are extreme observations that for one reason or another do not belong with the other observations in the DATA. 数据挖掘研究院

If least square regression estimators are routinely applied to data which contain a few wild observations, then the obtained estimates can be seriously misleading. 数据挖掘研究院

It is therefore critically important to investigate the data for the presence of outliers whenever least square regression procedures are used. 数据挖掘研究院


Causes of Outliers

There are several ways outliers can be caused.
  1. Data entry errors can produce outliers.*
  2. Sometimes the cases are not a homogeneous set to which a single MODEL will apply, but rather a heterogeneous set of two or more types of cases. One of these types will be far more frequent than the other, forcing the few to be identified as outliers.
  3. The third cause of outliers is produced by error distributions with "thick" tails, in which extreme observations occur with greater frequency than expected for a normal distribution. Least squares solutions are fairly robust to violations of the assumption that the errors are normally distributed, except when the violation is that the distribution has thick tails. Ironically, sampling distributions that look quite different from a normal distribution cause little trouble, while these thick tail distributions raise cane with F interpretations.

* This is the most frequent cause of outliers that I have found. Humans often just don′t pay enough attention on data entry, and with a little heavy pressure on the keys type in values like 880 for an IQ that was really supposed to be 80.
To Explore Outliers, the text provides Exhibit 9.2 (see page 211 in Judd and McClelland).

Scatter Plots

. It is often easy to detect outliers using simple scatter plots. For example when a scatter plot using highschool rank (HRANK) to predict Scholastic Aptitude Scores (SAT) contained in Exhibit 9-2 is graphed, the outlier is clearly evident.


Detecting Outliers

There are essentially three different characteristics that produce outliers.
  1. The predictor(s) value(s) for the data point can be unusual.
  2. The criterion value for the data point can be unusual.
  3. The data point is unduly influencing the parameter estimates of the predictions.

If a data point has any of these three characteristics, it can be considered an outlier which requires attention from the researcher. 数据挖掘实验室

Unusual Predictor Values?

Levers identify outliers that have this characteristic. To answer the question whether the values for a set of predictors are unusual, you only need to consider the value of the lever.

We do not have formal rules for determining the cut off for the lever, but there are informal guidelines. 数据挖掘研究院

According to Judd and McClelland, Velleman & Welsch (1981) suggest that levers two or three times the average ought to be considered large and in need of further attention in the data analysis.

Again according to Judd and McClelland, Huber (1981) suggests that any lever over .2 deserves special attention whenever n is reasonably large. 数据挖掘研究院

Statlets Output

The multiple regression solution provides a Leverage tab, that automatically calculates leverage values for any unusual data points. Below is the output using Exhibit 9-2 along with the interpretation.
Influential points
------------------------------------------------------
                           Mahalanobis
Row           Leverage        Distance           DFITS
------------------------------------------------------
6             0.759815         33.8814          8.2275
------------------------------------------------------
Average leverage = 0.15384615384615385
The table of influential data points lists all observations which
have leverage values greater than 3 times that of an average data
point, or which have an unusually large value of DFITS. Leverage is
a statistic which measures how influential each observation is in
determining the coefficients of the estimated model. DFITS is a
statistic which measures how much the estimated coefficients would
change if each observation was removed from the data set. In this
case, an average data point would have a leverage value equal to
0.15384615384615385. There is one data point with more than 3 times
the average leverage, but none with more than 5 times. There is one
data point with an unusually large value of DFITS. 

 数据挖掘研究院 

Is Yi Unusual?

The first question is unusual with respect to what? Obviously, we want to answer the question of whether a criterion value is unusual with respect to the model we are using.

数据挖掘研究院

Residuals look at differences between the model and the criterion value. 数据挖掘研究院

Raw residuals have two major problems. Their values depend on the scale of the variables, and, unusual criterion values tend to "grab" the regression line -- producing smaller residuals than when the criterion point is left out of the analysis. 数据挖掘研究院

Studentized deleted residuals solve both of these problems. In addition, none of the other commonly reported transformed residual scores detect outliers any better than the studentized deleted residual, and finally, if you square the studentized deleted residual you have the appropriate F statistic (remember student developed the t statistic) for testing an outlier model where a single parameter is used for the case in question, against a model where that parameter is not used for that particular case. 数据挖掘研究院

Different Residuals

Blue = All nine, Magenta = first eight values


On the left you see an applet where two regression lines are drawn. As you move the data values along the y-axis, the blue regression line is formed using all nine values, while the magenta regression line is formed using only the first eight values. To see how this works, first form a typical plot using all the values, and then move the last (ninth) data value far away from the others near it.

On the far right of the applet two different residual values are plotted. The blue line indicates the difference between this ninth value and the blue regression line which includes this ninth value in its calculation (regular residual). The magenta line indicates the difference between the ninth value and the magenta regression line that is formed without using the ninth value (deleted residual). Note how these two residuals can be drastically different from one another.

This second magenta residual is not exactly a studentized deleted residual, but it gives a good idea of the two different residuals and their differences.

In general, Judd & McClelland do not recommend that the squared studentized deleted residual or the F that results from evaluating the outlier model be used as a formal statistical test unless one has external information questioning the validity or reliability of a particular observation (don′t go on a witch hunt).

Four rules-of-thumb:

  1. Studentized deleted residuals with an absolute value less than 2 are not unusual.
  2. If the value is greater than 2 it deserves a look.
  3. If the value is greater than 4, than all alarm bells out to sound.
  4. Accepting a MODEL for which a studentized deleted residual is greater than 4 could be quite misleading.

Statlets output

If you solve the regression analysis using multiple regression, and click the Residuals tab, both residuals and student deleted residuals will be calculated for any unusual data points. Below is the Statlets output for Exhibit 9.2.
Unusual residuals
----------------------------------------------------------------------
                             Predicted                     Studentized
Row                  Y               Y       Residuals       Residuals
----------------------------------------------------------------------
6                 86.0         72.6715         13.3285            4.63
---------------------------------------------------------------------- 数据挖掘研究院 


If you would like to work the regressions and do the model comparisons that are explained on pages 221-223 in Judd and McClelland, you can use the new Judd9-2 Exhibit data. The variable xi2 is the variable discussed on page 222 where all its values are set to zero except for the sixth observation. The variable xi22 is the variable discussed on page 223. Its value is set at 1 for the first case, and zero for each case afterword.

Would Omission of the data point Dramatically Change the Regression Equation?

Cook′s D tests to see whether the error in the model changes when a specific data value is either included, or excluded from the model.

There are only informal guidelines for interpreting Cook′s D.
  1. One recommendation is to consider values to be large which exceed 4/PAn.
  2. Another suggested rule is to consider any value greater than 1 or 2 as indicating that an observation requires a careful look.
  3. Finally, some researchers look for gaps between the D values.

In summary, Cook′s D assesses the global impact of each observation on the parameter estimates and, equivalently, on the predictions for all other observations. Large values of D identify those observations with large impacts. We therefore can use Cook′s D to answer this third question. 数据挖掘研究院

Statlets Output

The Statlets applet outputs a value called DFITS (see the leverage table above) that functions the same as Cook′s D. In the Statlet′s glossary, DFITS is defined:
DFITS
       A statistic computed when fitting a multiple regression model to measure the change in
       each predicted value which would occur if a single data value was deleted. Large values
       correspond to points which have a big influence on the fitted model.  数据挖掘实验室 

The Statlet′s application does calculate Cook′s D values as shown in the Output tab below. 数据挖掘实验室

Judd & McClelland think that it is good practice to omit outliers from the analysis with the explicit admission in the report that there are some observations which are not understood. You can then report a good model for those observations which you do understand. 数据挖掘研究院

There is a good Exhibit (9.9) on page 231 in the text which shows what each type of outlier does to the regression research. 数据挖掘研究院

Statlet′s Predicted Plot

By clicking the Predicted Plot tab, Statlets produces a very valuable plot for evaluating outliers. The plot displays the observed values of the criterion variable versus the regression solution. Below is the predicted plot using HRANK to predict SAT scores.

This plot displays observed values of SAT versus values predicted by
the fitted model. The closer the points lie to the diagonal line,
the better the model at predicting the observed data. You should
look for various anomalies, such as increases in variability around
the line as the value of SAT increases (heteroscedasticity), or
individual data points which lie far away from the line (outliers). 数据挖掘研究院 

Statlet′s Residuals Plot

Another valuable plot for detecting outliers is produced by clicking the Residuals Plot tab. Using the options button, you can change the type of residuals plotted, but the default Studentized deleted residuals is typically used. Below is the Residuals Plot for Exhibit 9-2 along with the interpretation.

This plot displays the Studentized residuals versus values of
HSRANK. Any non-random pattern could indicate that the selected
model does not adequately describe the observed data. In addition,
any values outside the range of -3 to +3 could well be outliers.  数据挖掘研究院 

Influence Plots

While Statlets doesn′t have a feature where scatter plots can be drawn where the size of the dot is proportional to the influence the data point has on the regression equation, other statistical packages can be used to produce these plots. On the left is a SYSTAT influence plot produced using Exhibit 9-2.

Notice how the outlier on the far left of the figure is drawn using a larger dot than the others. This plot indicates that the value corresponding to the large dot in influencing the regression far more than the other values. The size of this dot is directly related to Cook′s D.
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?