Optimality Conditions of the Theorem

Theorem 1 shows that provided we know the PDF under some reference hypothesis $H_0$ at both the input and output of transformation $T({\bf x})$, if we are given an arbitrary PDF $g({\bf z})$ defined on ${\bf z}$, we can immediately find a PDF $G({\bf x})$ defined on ${\bf x}$ that generates $g({\bf z})$. While it is interesting that $G({\bf x})$ generates $g({\bf z})$, there are an infinite number of them and it is not yet clear that $G({\bf x})$ is the best choice. However, suppose we would like to use $G({\bf x})$ as an approximation to the PDF $p({\bf x}\vert H_1)$. Define

$\displaystyle \hat{p}({\bf x}\vert H_1) \stackrel{\mbox{\tiny$\Delta$}}{=}{p({\...
...\; \hat{p}({\bf z}\vert H_1) \;\;\;\; \mbox{ where } \;\;\; {\bf z}=T({\bf x}).$ (2.6)

From Theorem 1, we see that (2.6) is a PDF. Furthermore, if $T({\bf x})$ is a sufficient statistic for $H_1$ vs $H_0$, we have

$\displaystyle {\hat{p}({\bf x}\vert H_1) \over p({\bf x}\vert H_0)}
= {p({\bf z}\vert H_1) \over p({\bf z}\vert H_0)},$ (2.7)

thus, as $\hat{p}({\bf z}\vert H_1)\rightarrow p({\bf z}\vert H_1)$, we have

$\displaystyle \hat{p}({\bf x}\vert H_1)\rightarrow p({\bf x}\vert H_1),
$

or that the PDF estimate $\hat{p}({\bf x}\vert H_1)$ approaches the true PDF $p({\bf x}\vert H_1)$.

This result gives us guidance about how to choose not just the features, but also $H_0$. In short, in order to make the projected PDF as good as possible an approximation to $p({\bf x}\vert H_1)$, choose $T({\bf x})$ and $H_0$ so that $T({\bf x})$ is approximately sufficient statistic for the likelihood ratio test between $H_1$ and $H_0$. But, the sufficiency condition is required for optimality, but is not necessary for 2.6 to be a valid PDF. Here we can see the importance of the theorem which provides a means of creating PDF approximations on the high-dimensional input data space without dimensionality penalty using low-dimensional feature PDFs. It also provides a way to optimize the approximation by controlling both the reference hypothesis $H_0$ as well as the features themselves. This is the remarkable property of Theorem 1 - that the resulting function remains a PDF whether or not the features are sufficient statistics. Since sufficiency means optimality of the classifier, approximate sufficiency mean PDF approximation and approximate optimality.