RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎
当前位置 :| 首页>人工智能>机器学习>

KL divergence

来源: 作者:互联网作品 时间:2007-05-20 点击:

The Kullback and Leibler divergence is a common measure of the “distance” between two probability distributions. It’s central in probabilty based machine learning algorithm. 数据挖掘研究院

For instance, when trying to approximate an intractable distribution p(x), we can try minimize KL(p,q) (or KL(q,p)) with q belonging to a particular class of distributions (ex: exponential family). 数据挖掘研究院

(KL is used in variational methods and approximate inference message passing algorithms.)

Another scenario is when we are considering that a “true” distribution p generated some data, and we infer q from our prior and data without knowing p. In this case KL(p,q) measure how close we are from the true p. This can help to derive some learning bounds and estimator proprieties.

数据挖掘研究院

But why using this divergence (which isn’t a real distance)? Why not a more classical L^2 distance ? Or Chi-square ? 数据挖掘研究院

There are several leads: 数据挖掘实验室

  • Information geometry: KL is a special case of delta-divergences (comming from delta-connection) These divergences have the great advantage to be invariant by reparametrisation.
  • Information theory: KL can be seen as the amount of information (in bits) missing to q in order to specify p. (conditional entropy). It is the average “surprise” of a incoming message drawn from q when you expect it arrived form p.
  • Bayesian theory: KL minimisation can be “derived” from log-likelihood maximisation.

Invariance seems to be the more general requirement. Closeness beetween distribution should not depend on parametrisation and base measure. 数据挖掘研究院

Moreover the delta divergence point of vue gives us a better understanding of different appromate inference algorithm. Belief propagation, expectation propagation, variational bayes, mean field, tree reweighted belief propagation, power expectation propagation, generalised belief propagation are unified with delta-divergences. (this paper). 数据挖掘研究院

The information theory justification seems weaker to me because it takes place in a theory of communication, requiring an emmitter, a channel and a receiver. However it’s intuitive and expressing divegence in bits shows the parametrisation independence.
Finally I’m still not sure of my derivation from Log Likelihood minimisation, especially in the continous case.

数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?