Optimality Conditions of the Theorem

Theorem 1 shows that provided we know the PDF under some reference hypothesis $ H_0$ at both the input and output of transformation $ T({\bf x})$, if we are given an arbitrary PDF $ g({\bf z})$ defined on $ {\bf z}$, we can immediately find a PDF $ G({\bf x})$ defined on $ {\bf x}$ that generates $ g({\bf z})$. While it is interesting that $ G({\bf x})$ generates $ g({\bf z})$, there are an infinite number of them and it is not yet clear that $ G({\bf x})$ is the best choice. However, suppose we would like to use $ G({\bf x})$ as an approximation to the PDF $ p({\bf x}\vert H_1)$. Define

$\displaystyle \hat{p}({\bf x}\vert H_1) \stackrel{\mbox{\tiny$\Delta$}}{=}{p({\...
...\; \hat{p}({\bf z}\vert H_1) \;\;\;\; \mbox{ where } \;\;\; {\bf z}=T({\bf x}).$ (2.6)

From Theorem 1, we see that (2.6) is a PDF. Furthermore, if $ T({\bf x})$ is a sufficient statistic for $ H_1$ vs $ H_0$, we have

$\displaystyle {\hat{p}({\bf x}\vert H_1) \over p({\bf x}\vert H_0)} = {p({\bf z}\vert H_1) \over p({\bf z}\vert H_0)},$ (2.7)

thus, as $ \hat{p}({\bf z}\vert H_1)\rightarrow p({\bf z}\vert H_1)$, we have

$\displaystyle \hat{p}({\bf x}\vert H_1)\rightarrow p({\bf x}\vert H_1),
$

or that the PDF estimate $ \hat{p}({\bf x}\vert H_1)$ approaches the true PDF $ p({\bf x}\vert H_1)$.

This result gives us guidance about how to choose not just the features, but also $ H_0$. In short, in order to make the projected PDF as good as possible an approximation to $ p({\bf x}\vert H_1)$, choose $ T({\bf x})$ and $ H_0$ so that $ T({\bf x})$ is approximately sufficient statistic for the likelihood ratio test between $ H_1$ and $ H_0$. But, the sufficiency condition is required for optimality, but is not necessary for 2.6 to be a valid PDF. Here we can see the importance of the theorem which provides a means of creating PDF approximations on the high-dimensional input data space without dimensionality penalty using low-dimensional feature PDFs. It also provides a way to optimize the approximation by controlling both the reference hypothesis $ H_0$ as well as the features themselves. This is the remarkable property of Theorem 1 - that the resulting function remains a PDF whether or not the features are sufficient statistics. Since sufficiency means optimality of the classifier, approximate sufficiency mean PDF approximation and approximate optimality.

Baggenstoss 2017-05-19