## Why Maximum entropy?

Suppose that we have a set of training samples and a number of proposed PDFs, and would like to determine which is a better" fit to the data. We can compare the PDFs based on the average log-likelihood But likelihood comparison by itself is misleading, so we introduce another relevant quantity: is the entropy of a distribution, defined as which is the negative of the theoretical value of . It is a generalization of the concept of variance. Distributions that spread the probability mass over a wider area have higher entropy since the average height of the distribution is lower.
The two concepts of and are compared in Figure 3.1 in which we show three competing distributions: , , and . The vertical lines represent the location of the training samples. is the average height of the log-PDF value for these distributions and these samples. Clearly

But choosing is very risky because it seems to be over-adapted to the training samples. Clearly has lower entropy since most of the probability mass is at places with higher likelihood. Therefore, it has achieved higher at the cost of lower , a suspicious situation. On the other hand, , but . Therefore, has achieved higher than without suffering lower . Choosing over is not risky. Thus, if we always choose the reference hypothesis that achieves the highest entropy for a given and , and we are careful to use data holdout (i.e. separation of training and testing data) for estimation of , then we are confident that any increase in is due solely to the choice of better features. The features are, in short, more descriptive of the data because the unobserved" dimensions, as we prevoiusly defined, are distributed in the broadest possible way (highest entropy). This highest-entropy distribution, which is found by maximizing the entropy of over , is denoted by . More precicely,

Baggenstoss 2017-05-19