RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎

贝页斯讨论:数据挖掘与模型

来源: 作者:unkonwn 时间:2004-12-09 点击:

Many readers of this column will never have heard of Bayes′ Theorem or Bayesian inference. The latter has sparked years of debate among statistical researchers and is evidently making a comeback in how widely it is used. This debate presents several issues that are not only interesting but relate to highly pragmatic concerns that many people should find useful.

数据挖掘研究院

Thomas Bayes was an 18th century minister from England who developed one of the basic principles of probability. It is a simple mathematical formula (that I will forego writing down) that shows how the probability of a random event occurring is modified when partial information relevant to the event is obtained and considered. If I am in a casino playing roulette, there are 38 equally probable slots into which to ball can fall (1 through 36, 0 and 00). If I have a bet on number 10, my probability of winning is 1/38th (about 2.6 percent). Now suppose that the ball falls and initially I can′t see the number, but I can see that it has landed in a red slot. (Even numbers are red, odd numbers are black, 0 and 00 are green.) Now there are only 18 slots into which the ball could have fallen. Since 10 is a red slot and is still in the running, my probability of winning is now 1/18 (about 5.6 percent). Knowing that the ball landed in a red slot has modified my probability of winning, more than doubling the chance for success. 数据挖掘研究院

Statistical inference is a way of quantifying knowledge about unknown quantities using observed data. (The non-Bayesian approach is sometimes call the "frequentist" approach, a terminology I will adopt.) Before explaining how the Bayesian and frequentist approaches differ, let me be clear about some aspects of the approaches that are not different.

数据挖掘研究院

  • Both approaches accept and use Bayes′ Theorem as critical components of the analysis.
  • Both approaches use models with unknown parameters to characterize the real world. For example, the probability of purchasing a luxury car might be modeled as:
    Pr(Luxury Car) = a + b*Income
    where "a" and "b" are unknown parameters.
  • Both approaches need to collect data observations as a basis for estimating the values of the unknown parameters.

The basic difference in the approaches is in how they treat the unknown parameters. 数据挖掘研究院

The frequentist takes the unknown parameters as fixed values. While the researcher does not know the value of the parameters, some "true" values are taken to exist. Analysis proceeds by using statistics to determine the probability of observing the actual data under alternative values of the parameters. Inferences about the parameters are based upon which alternative values of the parameters best make the expected observations match the actual observations. 数据挖掘研究院

For example, suppose I observe a roulette table for 100 spins and observe 52 occurrences of "red." If the table is "fair" (and the probability of red is about 47 percent) then the chances of observing 52 or more "reds" is about 20 percent. Having something happen every 1 out of 5 times is not particularly unlikely, so I would not question the fairness of the table. 数据挖掘研究院

A Bayesian would agree that there is some true (but unknown) value of the parameters. However, rather than analyzing the probability of observing the data, they would treat the parameters as random and would analyze the probability distribution of the parameters themselves. The probability distribution of the parameters represents that researcher′s "beliefs" or knowledge about the actual values of the parameters. 数据挖掘研究院

For the Bayesian, analysis proceeds by using the data to create a more narrow, probable distribution for the parameters. This is where Bayes′ Theorem plays a central role. Just as we used the theorem earlier to increase the probability that a roulette spin comes up on number 10, the theorem is applied to the parameter distribution to narrow down the range of likely parameter values.

Recall, however, that Bayes′ Theorem can only be used to modify an existing (or prior) probability distribution. Similarly, in order to use Bayesian analysis the researcher must specify a "prior" distribution for the parameters that captures the researcher′s "going in" beliefs about the likely values for the parameters.

数据挖掘研究院

This is where frequentists really start to object. Researchers, they say, should not be biasing the outcome of the analysis by bringing in their personal prior beliefs. To do so seems arbitrary and unscientific. 数据挖掘研究院

The Bayesian counter argument, as I understand it, is twofold. 数据挖掘研究院

First they would argue that yes, at times the researcher should bring prior beliefs into the analysis - if those beliefs are strongly held. 数据挖掘研究院

Second, in many Bayesian analyses the outcome is not really influenced by the specified prior distribution. Rather, the distribution specified is weighted very evenly across possible values of the parameters, to reflect the "going in" uncertainty about these values. When a large amount of data is collected, the influence of the data on the analysis swamps any small influence of the prior. 数据挖掘研究院

If we accept that Bayesian analysis doesn′t have fatal methodological flaws, does it have any advantages over the more widely used frequentist approach? Possibly - while the importance of these points depends on the particular problem, there can be some significant advantages to Bayesian analysis. 数据挖掘研究院

  • As already mentioned, it incorporates the impact of prior beliefs when appropriate.
  • Analyzing the parameters as if they follow a probability distribution can provide more intuitive and easy to understand results. (For example, "There is a 95 percent probability that the parameter lies between 1.3 and 1.7.") It also makes it easy to incorporate the results into the decision-making process in a way that recognizes the uncertainty of the parameter estimates.
  • Recent advances in methodology have made estimation of some models much easier using Bayesian analysis.

In truth, this last reason is the only reason I developed an interest in Bayesian analysis. These techniques allow the estimation of certain complicated models in days, when it would have taken weeks under the frequentist approach if it was possible at all.

数据挖掘研究院

My personal conclusions are as follows. First, neither the frequentist nor Bayesian approaches are "wrong." Second, the frequentist approach is what most researchers (at least in the U.S.) are used to. For most applications, this will by my approach. But, third, when I have a pragmatic reason to use Bayesian analysis I will not hesitate to do so.

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?