|
The Kullback and Leibler divergence is a common measure of the “distance” between two probability distributions. It’s central in probabilty based machine learning algorithm. 本文转载自数据挖掘研究院
For instance, when trying to approximate an intractable distribution p(x), we can try minimize KL(p,q) (or KL(q,p)) with q belonging to a particular class of distributions (ex: exponential family).
(KL is used in variational methods and approximate inference message passing algorithms.) 数据挖掘
Another scenario is when we are considering that a “true” distribution p generated some data, and we infer q from our prior and data without knowing p. In this case KL(p,q) measure how close we are from the true p. This can help to derive some learning bounds and estimator proprieties. 商业智能
But why using this divergence (which isn’t a real distance)? Why not a more classical L^2 distance ? Or Chi-square ? 搜索引擎
There are several leads:
- Information geometry: KL is a special case of delta-divergences (comming from delta-connection) These divergences have the great advantage to be invariant by reparametrisation.
- Information theory: KL can be seen as the amount of information (in bits) missing to q in order to specify p. (conditional entropy). It is the average “surprise” of a incoming message drawn from q when you expect it arrived form p.
- Bayesian theory: KL minimisation can be “derived” from log-likelihood maximisation.
Invariance seems to be the more general requirement. Closeness beetween distribution should not depend on parametrisation and base measure. 数据仓库
Moreover the delta divergence point of vue gives us a better understanding of different appromate inference algorithm. Belief propagation, expectation propagation, variational bayes, mean field, tree reweighted belief propagation, power expectation propagation, generalised belief propagation are unified with delta-divergences. (this paper). 搜索引擎
The information theory justification seems weaker to me because it takes place in a theory of communication, requiring an emmitter, a channel and a receiver. However it’s intuitive and expressing divegence in bits shows the parametrisation independence. Finally I’m still not sure of my derivation from Log Likelihood minimisation, especially in the continous case. 数据挖掘
|