Maximum likelihood and PDF Projection

We have stated that when we use a floating reference hypothesis, we prefer to choose the reference hypothesis such that the numerator of the J-function is a maximum. Since we often have parametric forms for the PDFs, this amounts to finding the ML estimates of the parameters. If there are a small number of features, all of the features are ML estimators for parameters of the PDF, and there is sufficient data to guarantee that the ML estimators fall in the asymptotic (large data) region, then the floating hypothesis approach is equivalent to an existing approach based on classical asymptotic ML theory. We will derive the well-known asymptotic result using (2.15).

Two well-known results from asymptotic theory [17] are the following.

  1. Subject to certain regularity conditions (large amount of data, a PDF that depends on a finite number of parameters and is differentiable, etc.), the PDF $p_x({\bf x};$$\theta$$^*)$ may be approximated by

    $\displaystyle p_x({\bf x};$$\displaystyle \mbox{\boldmath$\theta$}$$\displaystyle ^*) \simeq p_x({\bf x}; \hat{\mbox{\boldmath$\theta$}})
\; \exp\l...
...theta$}}) (\mbox{\boldmath$\theta$}^*- \hat{\mbox{\boldmath$\theta$}})\right\},$ (2.18)

    where $\theta$$^*$ is an arbitrary value of the parameter, $\hat{\mbox{\boldmath $\theta$}}$ is the maximum likelihood estimate (MLE) of $\theta$, and ${\bf I}($$\theta$$)$ is the Fisher's information matrix (FIM) [17]. The components of the FIM for PDF parameters $\theta_{k},\theta_{l}$ are given by

    $\displaystyle {\bf I}_{\theta_k,\theta_l}($$\displaystyle \mbox{\boldmath$\theta$}$$\displaystyle )=-{\bf E}\left(\frac{\partial^{2}
\ln p_x({\bf x}; \mbox{\boldmath$\theta$})
}{\partial\theta_{k}\partial\theta_{l}}\right).
$

    The approximation is valid only for $\theta$$^*$ in the vicinity of the MLE (and the true value).

  2. The MLE $\hat{\mbox{\boldmath $\theta$}}$ is approximately Gaussian with mean equal to the true value $\theta$ and covariance equal to ${\bf I}^{-1}($$\theta$$)$, or

    $\displaystyle p_\theta(\hat{\mbox{\boldmath$\theta$}};\mbox{\boldmath$\theta$})...
...$\theta$}}) (\mbox{\boldmath$\theta$}- \hat{\mbox{\boldmath$\theta$}})\right\},$ (2.19)

    where $D$ is the dimension of $\theta$. Note we use $\hat{\mbox{\boldmath $\theta$}}$ in evaluating the FIM in place of $\theta$, which is unknown. This is allowed because ${\bf I}^{-1}($$\theta$$)$ has a weak dependence on $\theta$. The approximation is valid only for $\theta$ in the vicinity of the MLE.

To apply equation (2.15), $\hat{\mbox{\boldmath $\theta$}}$ takes the place of ${\bf z}$ and $H_0$$({\bf z})$ is the hypothesis that $\hat{\mbox{\boldmath $\theta$}}$ is the true value of $\theta$. We substitute (2.18) for $p_x({\bf x}\vert H_0$$({\bf z})$$)$ and (2.19) for $p_z({\bf z}\vert H_0$$({\bf z})$$)$. Under the stated conditions, the exponential terms in approximations (2.18),(2.19) become 1. Using these approximations, we arrive at

$\displaystyle \hat{p}_x({\bf x}\vert H_1) = \left[{ p_x({\bf x}; \hat{\mbox{\bo...
...ac{1}{2}} }
\right] \; \hat{p}_\theta(\hat{\mbox{\boldmath$\theta$}}\vert H_1),$ (2.20)

which agrees with the PDF approximation from asymptotic theory [18], [19]. Equation (2.20) is very useful for integrating ML estimators into class-specific classifier and we will give examples of its use. The first term (in brackets) is the J-function.

To compare equations (2.15) and (2.20), we note that for both, there is an implied sufficiency requirement for ${\bf z}$ and $\hat{\mbox{\boldmath $\theta$}}$, respectively. Specifically, $H_0$$({\bf z})$ must remain in the ROS of ${\bf z}$, while $\hat{\mbox{\boldmath $\theta$}}$ must be asymptotically sufficient for $\theta$. However, (2.15) is more general since (2.20) is valid only when all of the features are ML estimators and only holds asymptotically for large data records with the implication that $\hat{\mbox{\boldmath $\theta$}}$ tends to Gaussian, while (2.15) has no such implication. This is particularly important in upstream processing where there has not been significant data reduction and asymptotic results don't apply. Using (2.15), we can make simple adjustments to the reference hypothesis to match the data better and avoid the PDF tails (such as controlling variance) where we are certain that we remain in the ROS of ${\bf z}$.

Example 5   We revisit example 3 and 4, this time using the ML approach. Note that $\hat{\mu}$, and $\hat{\sigma}^2$ are the ML estimates of mean and variance [15]. It is instructive to derive the CR bound for this problem (Section 17.5). Taking the log of (2.14),

$\displaystyle \log p({\bf x}; \mu,\sigma^2) = -\frac{N}{2} \log (2\pi\sigma^2)
-\frac{1}{2\sigma^2} \sum_{n=1}^N (x_i-\mu)^2 .$ (2.21)

We require the first derivatives

$\displaystyle \frac{\partial}{\partial \mu} \log p({\bf x}; \mu,\sigma^2)
= \frac{1}{\sigma^2} \sum_{n=1}^N (x_i-\mu),$

$\displaystyle \frac{\partial}{\partial \sigma^2} \log p({\bf x}; \mu,\sigma^2)
= -\frac{N}{2\sigma^2} +\frac{1}{2\sigma^4} \sum_{n=1}^N (x_i-\mu)^2 .$

Taking second derivatives,

$\displaystyle \frac{\partial^2}{\partial^2 \mu} \log p({\bf x}; \mu,\sigma^2)
= -\frac{N}{\sigma^2},$

$\displaystyle \frac{\partial^2}{\partial \mu\partial \sigma^2} \log p({\bf x}; \mu,\sigma^2)
= -\frac{1}{\sigma^4} \sum_{n=1}^N (x_i-\mu) ,$

$\displaystyle \frac{\partial^2}{\partial^2 \sigma^2} \log p({\bf x}; \mu,\sigma^2)
= \frac{N}{2\sigma^4} -\frac{1}{\sigma^6} \sum_{n=1}^N (x_i-\mu)^2 .$

The next step is to take $-{\cal E}(\;\;)$ of the above. Using that ${\cal E}(x_i)=\mu$, ${\cal E}(x_i-\mu)^2=\sigma^2$,

$\displaystyle I(\mu,\mu) = \frac{N}{\sigma^2},$

$\displaystyle I(\mu,\sigma^2) = 0,$

$\displaystyle I(\sigma^2,\sigma^2) = -\frac{N}{2\sigma^4} +\frac{1}{\sigma^6} N\sigma^2 =
\frac{N}{2\sigma^4}.$

Finally, the FIM for this problem is given by

$\displaystyle {\bf I}(\hat{\mu}, \hat{\sigma}^2) =
\left[ \begin{array}{cc}
\frac{N}{\sigma^2} & 0\\
0 & \frac{N}{2\sigma^4}
\end{array} \right],
$

whos inverse is the CR bound

$\displaystyle {\bf C}(\hat{\mu}, \hat{\sigma}^2) =
\left[ \begin{array}{cc}
\frac{\sigma^2}{N} & 0\\
0 & \frac{2\sigma^4}{N}
\end{array} \right].
$

Note the close relationship to the CLT approach used in example 4. There is essentially no difference aside from the variance of $z_1$ which is $\frac{2\sigma^4}{N}$ in the CR bound analysis, but $\frac{2\sigma^4}{N-1}$ in the CLT example. Whenever the ML approach can be used, it is, in fact the same as the CLT approach asymptotically as $N$ becomes large.

We let the floating reference hypothesis be that $H_0$$({\bf z})$$: \{\mu=\hat{\mu}, \sigma^2=\hat{\sigma}^2\}$, or in other words that the true values of $\mu$ and $\sigma^2$ are equal to the ML estimates. We have,

$\displaystyle \log p({\bf x}\vert H_0$$\displaystyle \mbox{\small$({\bf z})$}$$\displaystyle ) = -\frac{N}{2}\log(2\pi \hat{\sigma}^2)
-\frac{1}{2\hat{\sigma}^2}
\sum_{i=1}^N ( x_i-\hat{\mu})^2 .$ (2.22)

Note that

$\displaystyle \sum_{i=1}^N ( x_i-\hat{\mu})^2 = N\hat{\sigma}^2,$

leaving

$\displaystyle \log p({\bf x}\vert H_0$$\displaystyle \mbox{\small$({\bf z})$}$$\displaystyle ) = -\frac{N}{2}\log(2\pi \hat{\sigma}^2) -\frac{N}{2}.$ (2.23)

For $p({\bf z}\vert H_0$$({\bf z})$$)$, we have (See denominator of equation 2.20 ) that

$\displaystyle p({\bf z}\vert H_0$$({\bf z})$$\displaystyle ) = -\frac{D}{2}\log(2\pi) + \frac{1}{2} \log \left\vert
{\bf I}(\hat{\mu}, \hat{\sigma}^2) \right\vert, $

where $D=2$.

We therefore have that

$\displaystyle \log p({\bf z}\vert H_0$$({\bf z})$$\displaystyle ) = -\log(2\pi)-\frac{1}{2}\log\left\{
\frac{\sigma^2}{N} \cdot ...
...{N} \right\}
= -\log(2\pi)-\frac{1}{2}\log\left(\frac{2\sigma^6}{N^2} \right). $

We compared the J-functions using the above equations with the J-function from the fixed reference hypothesis (example 1). There was close agreement. For details, see Figure 2.5 and software/test_mv_ml.m.

Figure 2.5: Comparison of J-function from exact solution (example 1) with ML approximation.
\includegraphics[width=4.2in,height=3.9in, clip]{test_mv_ml.eps}

Another example of the use of the ML method is provided in section 5.2.8.