Optimality Conditions of the Theorem

Theorem 1 shows that provided we know the PDF under some reference hypothesis

at both the input and output of transformation $T({\bf x})$ , if we are given an arbitrary PDF $g({\bf z})$ defined on ${\bf z}$ , we can immediately find a PDF $G({\bf x})$ defined on ${\bf x}$ that generates $g({\bf z})$ . While it is interesting that $G({\bf x})$ generates $g({\bf z})$ , there are an infinite number of them and it is not yet clear that $G({\bf x})$ is the best choice. However, suppose we would like to use $G({\bf x})$ as an approximation to the PDF $p({\bf x}\vert H_1)$ . Define

$\displaystyle \hat{p}({\bf x}\vert H_1) \stackrel{\mbox{\tiny$\Delta$}}{=}{p({\... ...\; \hat{p}({\bf z}\vert H_1) \;\;\;\; \mbox{ where } \;\;\; {\bf z}=T({\bf x}).$

(2.6)

From Theorem 1, we see that (2.6) is a PDF. Furthermore, if $T({\bf x})$ is a sufficient statistic for

, we have

$\displaystyle {\hat{p}({\bf x}\vert H_1) \over p({\bf x}\vert H_0)} = {p({\bf z}\vert H_1) \over p({\bf z}\vert H_0)},$

(2.7)

thus, as $\hat{p}({\bf z}\vert H_1)\rightarrow p({\bf z}\vert H_1)$ , we have

$\displaystyle \hat{p}({\bf x}\vert H_1)\rightarrow p({\bf x}\vert H_1),$

or that the PDF estimate $\hat{p}({\bf x}\vert H_1)$ approaches the true PDF $p({\bf x}\vert H_1)$ .

This result gives us guidance about how to choose not just the features, but also . In short, in order to make the projected PDF as good as possible an approximation to $p({\bf x}\vert H_1)$ , choose $T({\bf x})$ and so that $T({\bf x})$ is approximately sufficient statistic for the likelihood ratio test between and . But, the sufficiency condition is required for optimality, but is not necessary for 2.6 to be a valid PDF. Here we can see the importance of the theorem which provides a means of creating PDF approximations on the high-dimensional input data space without dimensionality penalty using low-dimensional feature PDFs. It also provides a way to optimize the approximation by controlling both the reference hypothesis as well as the features themselves. This is the remarkable property of Theorem 1 - that the resulting function remains a PDF whether or not the features are sufficient statistics. Since sufficiency means optimality of the classifier, approximate sufficiency mean PDF approximation and approximate optimality.