Why Maximum entropy?

The use of the maximum entropy criterion in PDF design is well established [23], and we provide an illustrative example here. Suppose that we have a set of $n$ training samples ${\bf x}_1, \; {\bf x}_2, \; \ldots , {\bf x}_n$ and a number of proposed PDFs, $G_m({\bf x})$ and would like to determine which is a “better" fit to the data. We can compare the PDFs based on the average log-likelihood $L_m=\frac{1}{n}\sum_{i=1}^n \log G_m({\bf x}_i).$ But likelihood comparison by itself is misleading, so we introduce another relevant quantity: the entropy of a distribution, defined as $Q_m= -\int_{{\bf x}} \{ \log G_m({\bf x}) \}\; G_m({\bf x}) {\rm d}{\bf x},$ which is the negative of the theoretical value of $L_m$. It is a generalization of the concept of variance. Distributions that spread the probability mass over a wider area have higher entropy since the average height of the distribution is lower.
Figure: Comparison of entropy $Q$ and average log-likelihood $L$ for three distributions. The vertical lines are the locations of training samples.
\includegraphics[width=3.0in]{overtrain.eps}
The two concepts of $Q$ and $L$ are compared in Figure 3.1 in which we show three competing distributions: $G_A({\bf x})$, $G_B({\bf x})$, and $G_C({\bf x})$. The vertical lines represent the location of the $n$ training samples. $L_m$ is the average height of the log-PDF value for these distributions and these samples. Clearly

$\displaystyle L_A \ll L_C \ll L_B.$

But choosing $G_B({\bf x})$ is very risky because it seems to be over-adapted to the training samples. Clearly $G_B({\bf x})$ has lower entropy since most of the probability mass is at places with higher likelihood. Therefore, it has achieved higher $L$ at the cost of lower $Q$, a suspicious situation. On the other hand, $Q_A=Q_C$, but $L_C>L_A$. Therefore, $P_C({\bf x})$ has achieved higher $L$ than $P_A({\bf x})$ without suffering lower $Q$. Choosing $P_C({\bf x})$ over $P_A({\bf x})$ is not risky. Thus, if we always choose the reference hypothesis that achieves the highest entropy for a given $T({\bf x})$ and $g({\bf z})$, and we are careful to use data holdout (i.e. separation of training and testing data) for estimation of $g({\bf z})$, then we are confident that any increase in $L$ is due solely to the choice of better features. The features are, in short, more descriptive of the data because the “unobserved" dimensions, as we prevoiusly defined, are distributed in the broadest possible way (highest entropy). This highest-entropy distribution, which is found by maximizing the entropy of $G({\bf x};H_0,T,g)$ over $H_0$, is denoted by $G^*({\bf x}; T,g)$. More precicely,

$\displaystyle H^*_0=\arg \max_{H_0} Q\left\{G({\bf x}; H_0,T,g)\right\},$

$\displaystyle G^*({\bf x}; T,g) \stackrel{\mbox{\tiny $\Delta$}}{=}G({\bf x}; H^*_0,T,g).$