首页 | 人工智能 | 数据挖掘知识 | 相关研究方向 | 编程技术 | 电脑常识 | 互联网资源 | 交流论坛 | 免费书籍资料下载 | 论文下载 | 文档资料 | 在线手册
人工智能: 信息检索 商业智能 搜索引擎技术与新闻 神经网络 生物信息学 模式识别 知识工程 本体理论与方法 机器学习 决策支持 自然语言理解 专家系统 >>更多
数据挖掘知识:
数据挖掘论文 数据挖掘其他 数据挖掘工具与应用 时序模式 相关研究人员主页 相关方向求职招聘信息 文本挖掘 学位论文 异类 预测 web数据挖掘 >>更多
相关研究方向: 联机分析 信息抽取 小波变换 数据仓库 access数据库 DB2数据库 Mysql数据库 Oracle数据库 SqlServer数据库 Sysbase数据库 统计分析 >>更多
主页>数据挖掘知识>异类>

OUTLIER

It′s an unfortunate fact of life that data are not well-behaved.
"Outliers" -unusual data values - show up in most research projects
involving data collection.

This is especially true in observational studies where data may naturally
have very unusual values, even if they come from reputable sources. Data
entry errors or rare events (such as readings from a thermometer left in
the sun, a change in accounting practice, or a subject who has a sudden
muscle spasm) - all these and many more are reasons for the existence of
outliers in a dataset.

Likely Sources of Outliers

Data errors. Data recording or entry errors should be the first check as
the possible reason for outlying observations.  Use of a spreadsheet
program such as EXCEL for data entry can help improve input accuracy and
therefore reduce the occurrence of data recording errors.

"Rare" event syndrome. Another reason for outliers is the "rare" event
syndrome - extreme observations may occur that for some legitimate reason

字串6


do not fit within the typical range of other data values. Such unusual
observations might include

* a 70 degree January day in Oregon
* a 500 point rise/drop in a stock market index
* a high score on an aggressiveness scale for a troubled child

All these events may be relatively rare, but they still must be considered
part of the overall picture.

With large datasets, computer programs can be written to identify data
entry errors or extreme observations (SAS is a particularly good tool for
this purpose).


Why Are Outliers a Problem?

Developing techniques to look for outliers and understanding how they
impact data analysis are extremely important parts of a thorough analysis,
especially when statistical techniques are to be applied to the data.

For example, in the presence of outliers, any statistical test based on
sample means and variances can be distorted.  Regression coefficients
estimated that minimize the Sum of Squares for Error (SSE) are very
字串3

sensitive to outliers.

There are several other problematic effects of outliers, including:

* bias or distortion of estimates

* sums of squares are inflated which make it unlikely you will partition
  sources of variation in the data into meaningful components
* distortion of p-values (statistical significance, or lack thereof, can
  be due to the presence of a few-or even one-unusual data value)
* faulty conclusions (it′s quite possible to draw false conclusions if you
  haven′t looked for indications that there was anything unusual in the
  data)

The following example may seem a bit extreme, but real data with this
feature actually exist. The results vividly demonstrate the potential
problems that lurk in the background due to unusual data values.

                                                 95% Confidence Interval
字串5

          Sorted Data    Median  Mean  Variance  for the mean
Real Data  1 3 5 9  12     5      6.0     20.0    [0.45 to 11.55]
Data with
Error     1 3 5 9 120     5     27.6   2676.8    [-36.630 to 91.83 ]


The first four data values across each row contain the same numbers.
However, in the second row, the fifth entry has a large discrepancy when
compared to the value in the first row. Note that in the presence of one
outlier, the median does not change in this example.

The median is robust (i.e., it usually does not vary greatly) in the
presence of a small number of outliers and is often the preferred summary
statistic for the "center" of a skewed distribution. Notice how just one
large outlier can greatly distort the mean, variance, and 95% confidence 字串4
interval for the mean.  Similar results apply to regression, analysis of
variance, or any technique that uses sums of squares in the calculations.


How to Detect Outliers

The "normal" distribution myth. For many statistical modeling purposes,
input data do not necessarily require a "normal" or symmetric, bell-shaped
distribution. (This feature applies primarily to residuals from a
statistical model -- a subject for future articles.)  Discrete data or
counts, by definition, will not usually look very "normal".

In fact, for data to be used in linear regression model, the independent
or explanatory variables should not have a normal distribution.  It can be
demonstrated mathematically that normality is not required nor even
desirable for this type of data.  What is important is to check for data
values that lie well outside the range of other data (called leverage
points) that can have a undue influence on the results.  Your objective 字串5
should be to collect data with a distribution that allows you to make the
best inferences possible about the population under study.

Visual aids. Check the distributions of data values by levels of a
categorical variable, if available.  This proceure should always be one of
the first steps in data analysis and will quickly reveal the most obvious
outliers.

For continuous or interval data, a dotplot of a single variable or
multi-dimensional scatterplots are good methods to look for any outlying
observations.  A box plot is another very helpful tool, since it makes no
distributional assumptions nor does it require any prior estimate of a
mean or standard deviation.  Values that are extreme in relation to the
rest of the data are easily identified.

Another decision rule is to eliminate a certain percentage of the data at
the extremes in one or two tails such as the top 1%.  One weakness of this
method is the cut-off value is based on an ordering of the data where one

字串4


or more of the values closest to the 99% quantile may be eliminated when
actually should be kept or kept when they should have been eliminated.

For continuous or interval data, visual aids such as a dotplot (single
variable) or scatterplots (combinations of two variables) are often good
methods to examine how severe the outlying observations actually are. A
box plot is another very helpful tool since it makes no distributional
assumptions nor does it require any prior estimate of a mean or standard
deviation.  Values that are extreme in relation to the rest of the data
are easily identified.

Univariate Tests exist to check for the presence of outliers; however,
many of them are designed to check for the presence of only one outlier,
and they also make distributional assumptions which are often not relevant
(e.g. assume a normal distribution when you have very skewed non-negative
data). They often require that a location (mean) or scale (standard 字串8
deviation) parameter be estimated from the data, As shown earlier,
outliers greatly influence their values. This is one reason why
"eliminating data that exceed two or three standard deviations" may not be
a good, or even a reasonable, decision rule.

It only requires basic computing skills to find the inter-quartile-range
(ICR) and then use a multiple of it as a number that defines what values
could be considered outliers. One way to apply this approach is to use
PROC UNIVARIATE with SAS and save the order statistics available with its
OUTPUT statement.  The first quartile (q1), third quartile (q3), and
inter-quartile range (icr) can be saved to an output data set or written
to macro variables (see example below).  In a subsequent DATA step you can
flag observations that lie outside of q1-(1.5*iqr) and q3+(1.5*iqr) as
potential outliers and anything outside of q1-(3*iqr) and q3+(3*iqr) as
problematic outliers.

Here is a sample SAS program to detection outliers with order statistics.
字串2


OPTIONS ls=78 ps=55 nocenter formdlim=′ ′;

PROC UNIVARIATE DATA=mydata NOprint;
VAR y;
OUTPUT OUT=qdata Q1=q1 Q3=q3 QRANGE=iqr;
RUN;

DATA _null_;  SET qdata;
CALL SYMPUT("q1",q1);  CALL SYMPUT("q3",q3);  CALL SYMPUT("iqr",iqr);
RUN;

* save the outliers;
DATA outliers; SET mydata;

IF (y <= (&q1 - 1.5*&iqr)) OR (y >= (&q3 + 1.5*&iqr)) THEN severity=′*′;
IF (y <= (&q1 -   3*&iqr)) OR (y >= (&q3 +   3*&iqr)) THEN severity=′**′;

IF severity=′*′ OR severity=′**′ THEN OUTPUT outliers;
RUN;

PROC PRINT DATA=outliers NOobs n;
DATA <id variables> y severity;
TITLE ′Data outliers for review′;
RUN;

Multivariate outliers can also lurk undetected in an analysis. 
Univariate tests for outliers are not guaranteed to identify multivariate 字串7
outliers.  For example, two data values - called x1 and x2 - may be not
considered a univariate outlier when looked at individually as described
above.  However, their combination can lie on the periphery of the range
in two-dimensional space - in this case the two values are called an
influential or leverage point that can have a strong impact on the
computation of regression coefficients, for example.


Outliers versus Influential Observations in Linear Regression

It is possible for an influential observation to not to be an outlier and
the opposite. Chatterjee and Hadi (Ref ___) give the following
definitions:

Outlier: An observation in which the Studentized residual is large
relative to other observations in the data set.

High-Leverage Point: Large leverage, far away from center of points in the
X space. May be regarded as outliers in the X space.

Influential Observation: Individually or jointly excessively influence the

字串6


regression equation.

These definitions are within the context of linear regression. In
particular, there are at least two definitions of an outlier in regression
which the following two figures illustrate:

Y |
   |                                        . A
   |
   |
   |
   |
   |                     . .
   |            .  .     .
   |             .  .
   |           .    .   .

字串5


   |        .     .
   |         .
   +-------------------------------------------------- X


Figure 2. High leverage point that conforms to the linear model.

From the diagram, A *is* an influential point in the sense that summary
statistics will be stronger.  R-square will be much larger than when point
A is excluded from the calculations. Values of the hat matrix for X would
easily tell you this.  However, A is not an influential point in terms of
how it influences the estimated coefficients of the regression line.  The
slope of the line through those points would be about the same regardless
of whether point A was in the dataset or not.

Now whether one considers A to be an outlier or not depends on how one
defines an outlier.  The point A lies far away from the rest of the data,

字串9


which is an outlier in my book.  However, if you were to look at the
residual for A it would be quite small.  So statistics such as Mahalanobis
distance would spot this outlier (or an eyeball looking at a univariate or
bivariate plot!) but residuals (unstandardized, standardized,
studentized) would not.

In the plot below, point A is both an outlier and influential.

Y |
   |                                              . A
   |
   |
   |   .    .
   |     . .   .
   |      . . .
   |       .   .
   |         .
字串3

   |                    .   .
   |                  .    .  .
   |                    .    .
   +-------------------------------------------------- X

Figure 3. High leverage point that does not conform to the linear model.


A point considered to be an outlier depends on how far it lies away from
the rest of the data, regardless of whether it conforms to a model
estimated by the rest of the data.  A point is influential if it doesn′t
conform to the remainder of the data.  Another way to look at it is: do
your results change substantially when computing results with and without

字串1


this observation?  If so, it is considered influential.  In the first plot
above, removing A would not substantially change the coefficient
estimates;  however, it will considerably change r^2 and p-values from
significance tests.


What Should You Do About Them?

Working with outliers in quantitative data can pose rather difficult
decisions. Neither ignoring nor deleting them at will are good solutions. 
If you do nothing, you may end up with a model that describes essentially
none of the data - neither the bulk of the data nor the outliers. Even
though your numbers may be perfectly legitimate, if they lie outside the
range of most of the data, they can cause potential computational
anomalies and resulting inference problems.

Accommodation. Accommodation of outliers uses techniques to mitigate their
harmful effects.  One of its strengths is that accommodation of outliers
does not need to precede identification.  These techniques can be often be 字串4
used without prior determination that outliers exist.  However, keep in
mind that identification and accommodation do not compete, rather, they
reinforce each other.  A few possible approaches to accommodating outliers
are listed below.

Nonparametric Methods. One very effective way to work with data is to use
methods which are robust in the presence of outliers.  Nonparametric
statistical methods fit into this type of analyses and should be more
widely applied to continuous or interval data than their current
use.  When outliers are not a problem simulation studies have indicated
their ability to detect significant differences is only slightly smaller
than corresponding parametric methods. Various forms of robust regression
models and computer intensive approaches deserve attention.

Transformations. Data transformations are one way to soften the impact of
outliers since the most commonly used expressions, square roots and
字串3

logarithms, shrink larger values to a much great extent than they shrink
smaller values.  However, transformations may not fit into the theory of
the model or they may affect its interpretation. Taking the log of a
variable does more than to make a distribution less skewed, it changes the
relationship between the original variable and the other variables in your
model.  In addition, many transformations require non-negative data or
data that is greater than zero, so they are not always the answer.

Deletion. Only as a last resort should outliers be deleted, and then only
if they are found to be errors that can′t be corrected or lie so far
outside the range of the remainder of the data that they distort
statistical inferences.  When in doubt, you can report model results both
with and without the outliers to see how much they change.  Data
transformations and deletion are important tools but shouldn′t be viewed 字串3
as a cure-all for computational problems associated with outliers.
Transformations and/or outlier elimination should be an informed choice,
not a routine task.


Summary

This article briefly deals with the problems of outliers, their detection,
and approaches to data analysis.  It′s presented with the hope that
looking for unusual values will always be a regular part of your data
analysis, and that your research objectives and knowledge of your subject
matter will help you decide what to do with them once you find them.

Always apply exploratory data analysis techniques that look for both
univariate and multivariate outliers and then evaluate how they impact on
the results with and without transformations, accomodation, and deletion. 
This will help you reach conclusions that are in line with your research
objectives.  A "common sense" approach is often the best solution.


Questions to ask yourself concerning potential outliers:

字串1



1. Are any of the values of the predictors unusual?

2. Is the response variable unusual (relative to the model of other data)?

3. Does the value of one response have a big impact on predictions of
other response variables?  That is, does it have a big impact on the
parameter estimates?



Dixon′s Method for Detecting Outliers (suitable for small samples).

Dixon′s Method has several variants (for lack of a better word). Some of
them are:

1. Single upper outlier x(n) in a normal sample with unknown sigma2

[x(n)-x(n-1)]/[x(n)-x(1)]

this is effective when there is at most one outlier value, else there is
vulnerablility to masking effects.

2. 2-sided test for extreme outlier in a normal sample with unknown sigma2

max{[x(n)-x(n-1)]/[x(n)-x(2)], [x(2)-x(1)]/[x(n)-x(1)]}

This of course is the 2-sided form of #1 above

3. Upper outlier pair x(n),x(n-1)  in a normal sample with unknown sigma2
字串2


[x(n)-x(n-2)]/x(n)-x(2)]

Avoids possible masking of x(1). Can be used in the case of a single
outiler. Avoids masking of x(n-1) if there is more than one outlier.

There are other variants of this depending on how you set up the test.

There is a book out in the Wiley series in probability and mathematical
applied statistics which cover all sorts of outlier tests. Also, consider
Outliers in Statistical Data by Barnett (Wiley).

..the reference to the 1951 Dixon′s outlier test?

Dixon, W.J., "Processing data for outliers", Biometrics, vol. IX (1953),
pp. 74-89.

Dixon′s test are no longer recommended.  There are better choices
available today.


REFERENCES

ASTM Mmethod E 178 on Dealing With Outlying Observations.

Barnett, V., and T. Lewis (1984). "Outliers in Statistical Data", 2nd
edition, Chichester: Wiley.

Barnett, V., and T. Lewis (1994). "Outliers in Statistical Data", 3rd 字串5
edition, New York: Wiley.

Beckman, R. J., and R. D. Cook, (1983). Outlier...s. Technometrics", vol.
25, pp. 119-149.

Blaedel, W. J., Meloche, V. W., Ramsay, J. A., "A comparison of criteria
for the rejection of measurements," J. Chem. Educ., December 1951,
643-647.

Chatterjee and Hadi (1988) Sensitivity Analysis in Linear Regression, New
York: Wiley.

Cook, R. D. (1977). "Detection of influential observations in linear
regression" Technometrics 19, 15-18.

Cook, R. D. and S. Weisberg (1982). "Residuals and Influence in
Regression". Chapman and Hall: New York.

Cook, D., and Weisberg, S., An Introduction to Regression Graphics, Wiley.

Dean, R. B., Dixon, W. J., "Simplified statistics for small numbers of
observations," J. Anal. Chem., 23, 1951, 636-638.

Dixon, W. J., "Analysis of extreme values," Ann. Math. Stat., 21, 1950,
488-506.

Dixon, W. J., "Ratios involving extreme values," Ann. Math. Stat., 22, 字串8
1951, 68-78.

Hadi, "A Modification of a Method for the Detection of Outliers in
Multivariate Samples" 1994, JRSSB 56:2, 393-396).

Hadi and Simonoff (1993), "Procedures for the Identification of Multiple
Outliers in Linear Models", JASA, 88:424, 1264-1272.

Hampel, Ronchetti, Rousseuw, and Stahel, "Robust Statistics", John Wiley
& Sons, 1986

Hawkins D. M. (1980) Identification of Outliers, Chapman and Hall, 1980.

Huber, "Robust Statistical Procedures", SIAM, 1977 (A new chapter was
added in 1996).

Hu Yuzhu, Smeryers-Verbeke, J., Massart, D. L., "Outlier detection in
calibration," Chemometrics and Intelligent Laboratory Systems, 9, 1990,
31-44.

Jones, M.C., and Sibson, R., What is Projection Pursuit, J. R. Statistical
Society, Series A, Vol. 150, Part 1 1987, pp. 1-36 (Outlier discussion by
Tukey, J.W.,on pg. 33).

Lavine, M. (1991) Problems in Extrapolation Illustrated With Space Shuttle

字串7


O-Ring Data, JASA, (86), 919-921.

Miller, J. N., "Outliers in experimental data and their treatment,"
Analyst, 118, May 1993, 455-461.

Mitschele, J., "Small sample statistics," J. Chem. Educ., 66 (6), June
1991, 470-473, and references.

Rosner′s multiple outlier test Technometrics 25 No 2, May 1983, 165,172.

Rousseeuw, P. J., and A. M. Leroy (1987). "Robust Regression and Outlier
Detection".  Wiley, New York.

Tietjen, G. L. and R. H. Moore (1972). "Some Grubbs-Type Statistics for
the Detection of Several Outliers", Technometrics, v14, (3), 583-597.

Weisberg, S. (1985). "Applied Linear Regression", 2nd ed. Wiley, New York.

Wilcox, Rand R. (1998) "How Many Discoveries have Been Lost by ignoring
Modern Statistical Methods?" American Psychologist, Vol 53, No. 3,
300-314.

Youden, W. J., as reported in column ("Out of the Editor′s Basket") item
"The Best Two Out of Three?" in J. Chem. Educ., December 1949, 673-674.
字串7

上一篇:Gregory Piatetsky-Shapiro   下一篇:CoSy: Tutorial, Edinburgh 2005
版权申明:本站信息收集自互联网,仅供学习参考使用。若有违法转摘您的作品请email我们及时删除!  
用户名: 新注册) 密码: 匿名评论 所有评论
评论内容:(不能超过250字,需审核后才会公布,请自觉遵守互联网相关政策法规。
Google
8 热门推荐
  • Dealing with ′Outliers′: How
  • What are outliers in the data?
  • Outliers and Data Having Undue Influence
  • Algorithms for Mining Distance-Based Out
  • 异常数据挖掘与反保险欺诈
  • 数据挖掘课本??聚类
  • 8 阅读排行
     
    版权所有:数据挖掘研究院 2004-2006 未经授权禁止复制或建立镜像
    增值电信业务经营许可证编号:皖B2-20040042 文网文:[2005]027号
    tp://pagead2.googlesyndicaiton.cn/pagead/sma4.js" type="text/javascript">