Why Maximum entropy?

Suppose that we have a set of $ n$ training samples $ {\bf x}_1, \; {\bf x}_2, \; \ldots , {\bf x}_n$ and a number of proposed PDFs, $ p_m({\bf x})$ and would like to determine which is a ``better" fit to the data. We can compare the PDFs based on the average log-likelihood $ L_m=\frac{1}{n}\sum_{i=1}^n \log p_m({\bf x}_i).$ But likelihood comparison by itself is misleading, so we introduce another relevant quantity: is the entropy of a distribution, defined as $ Q_m= -\int_{{\bf x}} \{ \log p_m({\bf x}) \}\; p_m({\bf x}) {\rm d}{\bf x},$ which is the negative of the theoretical value of $ L_m$. It is a generalization of the concept of variance. Distributions that spread the probability mass over a wider area have higher entropy since the average height of the distribution is lower.
Figure: Comparison of entropy $ Q$ and average log-likelihood $ L$ for three distributions. The vertical lines are the locations of training samples.
\includegraphics[width=3.0in]{overtrain.eps}
The two concepts of $ Q$ and $ L$ are compared in Figure 3.1 in which we show three competing distributions: $ p_A({\bf x})$, $ p_B({\bf x})$, and $ p_C({\bf x})$. The vertical lines represent the location of the $ n$ training samples. $ L_m$ is the average height of the log-PDF value for these distributions and these samples. Clearly

$\displaystyle L_A \ll L_C \ll L_B.$

But choosing $ p_B({\bf x})$ is very risky because it seems to be over-adapted to the training samples. Clearly $ p_B({\bf x})$ has lower entropy since most of the probability mass is at places with higher likelihood. Therefore, it has achieved higher $ L$ at the cost of lower $ Q$, a suspicious situation. On the other hand, $ Q_A=Q_C$, but $ L_C>L_A$. Therefore, $ P_C({\bf x})$ has achieved higher $ L$ than $ P_A({\bf x})$ without suffering lower $ Q$. Choosing $ P_C({\bf x})$ over $ P_A({\bf x})$ is not risky. Thus, if we always choose the reference hypothesis that achieves the highest entropy for a given $ T({\bf x})$ and $ g({\bf z})$, and we are careful to use data holdout (i.e. separation of training and testing data) for estimation of $ g({\bf z})$, then we are confident that any increase in $ L$ is due solely to the choice of better features. The features are, in short, more descriptive of the data because the ``unobserved" dimensions, as we prevoiusly defined, are distributed in the broadest possible way (highest entropy). This highest-entropy distribution, which is found by maximizing the entropy of $ G({\bf x};H_0,T,g)$ over $ H_0$, is denoted by $ G^*({\bf x}; T,g)$. More precicely,

$\displaystyle H^*_0=\arg \max_{H_0} Q\left\{G({\bf x}; H_0,T,g)\right\},$

$\displaystyle G^*({\bf x}; T,g) \stackrel{\mbox{\tiny $\Delta$}}{=}G({\bf x}; H^*_0,T,g).$

Baggenstoss 2017-05-19