Why Maximum entropy?
Suppose that we have a set of training samples
and a number of proposed PDFs,
and would like to determine
which is a ``better" fit to the data.
We can compare the PDFs based on the average loglikelihood
But likelihood comparison by itself is misleading,
so we introduce another relevant quantity: is the entropy of a distribution,
defined as
which is the negative of the theoretical value of .
It is a generalization of the concept of variance.
Distributions that spread the probability mass over
a wider area have higher entropy since the average height of the
distribution is lower.
Figure:
Comparison of entropy and average loglikelihood
for three distributions. The vertical lines are the locations of training samples.

The two concepts of and are compared in Figure 3.1 in which
we show three competing distributions:
,
, and
.
The vertical lines represent the location of the training samples.
is the average height of the logPDF value for these distributions and these samples.
Clearly
But choosing
is very risky because it seems to be
overadapted to the training samples.
Clearly
has lower entropy since
most of the probability mass is at places with higher
likelihood. Therefore, it has achieved
higher at the cost of lower , a suspicious situation.
On the other hand, , but .
Therefore,
has achieved higher than
without
suffering lower . Choosing
over
is not risky.
Thus, if we always choose the reference hypothesis
that achieves the highest entropy for a given
and
, and we are careful to use
data holdout (i.e. separation
of training and testing data) for estimation of
,
then we are confident that any increase in is due solely to
the choice of better features.
The features are, in short, more descriptive of the data
because the ``unobserved" dimensions,
as we prevoiusly defined, are distributed in the broadest
possible way (highest entropy).
This highestentropy distribution, which is
found by maximizing the entropy of
over , is denoted by
.
More precicely,
Baggenstoss
20170519