The DAF integral.

The desired integral is

$\displaystyle K_{\mbox{\tiny$T$}} = \int_{{\bf x}_1} \cdots \int_{{\bf x}_{\mbo...
... d} {\bf x}_1 \;
{\rm d} {\bf x}_2 \; \cdots {\rm d} {\bf x}_{\mbox{\tiny$T$}}.$ (15.2)

We first expand $L({\bf Y}) = \sum_{{\bf q} \in {\cal \bf Q}} \; p({\bf q}) \; L({\bf Y}\vert {\bf q}) .
$ where ${\bf q}$ is a particular length-$T$ Markov state sequence ${\bf q}=\{i,j,k,l \ldots p,q,r\}$ with apriori probability

$\displaystyle p({\bf q})=\pi_i \; A_{i,j} \; A_{j,k} \cdots \; A_{p,q} \; A_{q,r}.$

In what follows, the indexes $\{i,j,k,l \ldots p,q,r\}$ will always stand for the assumed states at times $1,2,3,4 \ldots T-2,T-1,T\}$, respectively. Using conditional independence,

$\displaystyle L({\bf Y}\vert {\bf q})=b_i({\bf y}_2)\; b_j({\bf y}_3)\; \cdots
\; b_q({\bf y}_{\mbox{\tiny$T-1$}}) \; b_r({\bf y}_{\mbox{\tiny$T$}}).$ (15.3)

Thus, we have

$\displaystyle \begin{array}{rcl}
L({\bf Y}) &=& {\displaystyle \sum_{i=1}^N} \...
...f y}_{\mbox{\tiny$T-1$}}) \; b_r({\bf y}_{\mbox{\tiny$T$}})  \\
\end{array}$

For tractability, we assume the state observation PDFs $b_k({\bf y})$ are Gaussian. This assumption does not limit this discussion since an HMM with Gaussian mixture state PDFs can be represented as an HMM with Gaussian state PDFs by expanding the individual mixture kernels as separate Markov states. We assume a special form for the means and covariances of $b_k({\bf y})$:

$\displaystyle \mbox{\boldmath$\mu$}$$\displaystyle _k = \left[ \begin{array}{l} \mbox{\boldmath$\mu$}_k^a   \mbo...
...Sigma}_k^{ab}   {\bf\Sigma}_k^{ba} & {\bf\Sigma}_k^{bb} \end{array}\right],$ (15.4)

where superscripts $a$ and $b$ refer to the partitions of ${\bf y}_t$ corresponding to ${\bf x}_{t-1}$ and ${\bf x}_t$, respectively (thus, $a$,$b$ are in order of increasing time). Note that the marginal PDFs are easily found, for example $b^b_k({\bf x})$ has mean $\mu$$_k^b,$ and covariance ${\bf\Sigma}_k^{bb}.$

The only term in (15.3) that depends on ${\bf x}_1$ is $b_i({\bf y}_2)$, which integrated over ${\bf x}_1$ is

$\displaystyle {\displaystyle \int_{{\bf x}_1}} b_i({\bf y}_2) = b_i^b({\bf x}_2) \; {\rm d}{\bf x}_1,$

so

$\displaystyle {\displaystyle \int_{{\bf x}_1}} \;
L({\bf Y}\vert {\bf q})=b_i^b...
...; \cdots
\; b_q({\bf y}_{\mbox{\tiny$T-1$}}) \; b_r({\bf y}_{\mbox{\tiny$T$}}).$ (15.5)

We now proceed to integrate (15.5) over ${\bf x}_2$. The only terms that depend on ${\bf x}_2$ are $b_i^b({\bf x}_2)$ and $b_j({\bf y}_3)$. We have

\begin{displaymath}\begin{array}{l}
{\displaystyle \int_{{\bf x}_1, {\bf x}_2}} ...
...box{\tiny$T-1$}}) \; b_r({\bf y}_{\mbox{\tiny$T$}})
\end{array}\end{displaymath} (15.6)

Using (15.4) and standard identities for the conditional distribution,

$\displaystyle b_j({\bf x}_2 \vert {\bf x}_3) = {\cal N}\left( {\bf x}_2-\mbox{\boldmath $\mu$}_c({\bf x}_3) , {\bf\Sigma}_c\right)$

, where

$\mu$$\displaystyle _c({\bf x}_3) =$   $\mu$$\displaystyle _j^a + {\bf\Sigma}_j^{ab} \; ({\bf\Sigma}_j^{bb})^{-1} \; \left( {\bf x}_3 - \mbox{\boldmath $\mu$}_j^b \right),$

and

$\displaystyle {\bf\Sigma}_c = {\bf\Sigma}_j^{aa} - {\bf\Sigma}_j^{ab} \; ({\bf\Sigma}_j^{bb})^{-1} {\bf\Sigma}_j^{ab^\prime}$

. Then, using the standard identity for the product of two Gaussians,

$\displaystyle \begin{array}{l}
{\cal N}({\bf x}_2-\mbox{\boldmath $\mu$}_i^b,{...
...dmath $\mu$}_c({\bf x}_3),{\bf\Sigma}_i^{bb}+{\bf\Sigma}_c\right)
\end{array}$

where

$\displaystyle {\bf\Sigma}^d = \left( ({\bf\Sigma}_i^{bb})^{-1} + {\bf\Sigma}_c^{-1} \right)^{-1} ,$

$\mu$$\displaystyle ^d = {\bf\Sigma}^d \left( ({\bf\Sigma}_i^{bb})^{-1} \mbox{\boldmath $\mu$}_i^b + {\bf\Sigma}_c^{-1} \mbox{\boldmath $\mu$}_c({\bf x}_3) \right)$

Integrating over ${\bf x}_2$ leaves us with

$\displaystyle \begin{array}{l}
{\displaystyle \int_{{\bf x}_2}} b_i^b({\bf x}_...
...ldmath$\mu$}_c({\bf x}_3),{\bf\Sigma}_i^{bb}+{\bf\Sigma}_c\right).
\end{array}$

We can convert this into a density of ${\bf x}_3$ using the fact that for any invertible matrix ${\bf A}$,

$\displaystyle {\cal N}({\bf x},{\bf\Sigma}) = { {\cal N}({\bf A}^{-1} {\bf x},{\bf A}^{-1} {\bf\Sigma}{\bf A}^{-1\prime} )
\over \vert{\rm det}({\bf A})\vert}$ (15.7)

Define ${\bf A}_j = {\bf\Sigma}_j^{ab} \; ({\bf\Sigma}_j^{bb})^{-1}.$ We have

$\displaystyle \begin{array}{l}
{\cal N}\left(\mbox{\boldmath$\mu$}_i^b-\mbox{\...
... {\bf\Sigma}_j^{bb} \right) \over \vert{\rm det}({\bf A}_j)\vert}.
\end{array}$

So we have

\begin{displaymath}\begin{array}{l}
{\displaystyle \int_{{\bf x}_1, {\bf x}_2}} ...
...box{\tiny$T-1$}})\; b_r({\bf y}_{\mbox{\tiny$T$}}),
\end{array}\end{displaymath} (15.8)

where

\begin{displaymath}\begin{array}{l}
\hat{\mbox{\boldmath$\mu$}} = \mbox{\boldmat...
...\Sigma}_j^{aa}){\bf A}_j^{-1} - {\bf\Sigma}_j^{bb}.
\end{array}\end{displaymath} (15.9)

We now proceed to integrate over ${\bf x}_3$. We re-write the product ${\cal N}\left({\bf x}_3 - \hat{\mbox{\boldmath $\mu$}},\hat{{\bf\Sigma}}\right) \; b_j^b({\bf x}_3)$ as

$\displaystyle \begin{array}{l}
{\cal N}\left({\bf x}_3 - \hat{\mbox{\boldmath$...
...\hat{\hat{\mbox{\boldmath$\mu$}}},\hat{\hat{{\bf\Sigma}}}\right) ,
\end{array}$

where

\begin{displaymath}\begin{array}{l}
\hat{\hat{{\bf\Sigma}}}=\left[ \hat{{\bf\Sig...
...ma}_j^{bb})^{-1} \mbox{\boldmath$\mu$}_j^b \right].
\end{array}\end{displaymath} (15.10)

Collecting results and integrating over ${\bf x}_3$,

\begin{displaymath}\begin{array}{l}
{\displaystyle \int_{{\bf x}_1,{\bf x}_2,{\b...
...ght\} \; b_k^b({\bf x}_4)
\; b_l({\bf y}_5) \cdots,
\end{array}\end{displaymath} (15.11)

Define

\begin{displaymath}
\begin{array}{l}
Q({\bf x}_{t+1} \ldots {\bf x}_T; \{j,k,l \...
...\
\cdots b_q({\bf y}_{T-1}) \; b_r({\bf y}_{T}),
\end{array}\end{displaymath}

then we may re-write (15.6) and (15.11) as

$\displaystyle {\displaystyle \int_{{\bf x}_1, {\bf x}_2}} \;
L({\bf Y}\vert {\bf q}) = Q({\bf x}_{3} \ldots {\bf x}_T; \{j,k \ldots p,q,r\},$   $\displaystyle \mbox{\boldmath$\mu$}$$\displaystyle _i^b, {\bf\Sigma}_i^{bb})
$

and

\begin{displaymath}\begin{array}{l}
{\displaystyle \int_{{\bf x}_1, {\bf x}_2, {...
...{\mbox{\boldmath$\mu$}}}, \hat{\hat{{\bf\Sigma}}}).
\end{array}\end{displaymath} (15.12)

Comparing the above equations, we can see a recursion. Because we have previously identified indexes $\{i,j,k,l \cdots p,q,r\}$ with fixed time indexes, to make a general expression for the recursion, we need to define the free indexes $m,n$ representing the assumed Markov states at the arbitrary times $t$,$t+1$, respectively. The recursion is

\begin{displaymath}
\begin{array}{l}
{\displaystyle \int_{{\bf x}_{t+1}}} \; Q({...
...{\mbox{\boldmath$\mu$}}}, \hat{\hat{{\bf\Sigma}}}),
\end{array}\end{displaymath}

where

\begin{displaymath}\begin{array}{l}
\hat{\mbox{\boldmath$\mu$}} = \mbox{\boldmat...
...f\Sigma}_m^{aa}){\bf A}_m^{-1} - {\bf\Sigma}_m^{bb}
\end{array}\end{displaymath} (15.13)

and

\begin{displaymath}\begin{array}{l}
\hat{\hat{{\bf\Sigma}}}=\left[ \hat{{\bf\Sig...
...ma}_m^{bb})^{-1} \mbox{\boldmath$\mu$}_m^b \right].
\end{array}\end{displaymath} (15.14)

The recursion starts by integrating (15.12) over ${\bf x}_4$ and ends with $Q( \; \; , \{\; \},$   $\mu$$, {\bf\Sigma}) \stackrel{\mbox{\tiny $\Delta$}}{=}1.$ It can be seen that the full integral

$\displaystyle K({\bf q}) = {\displaystyle \int_{{\bf x}_1}} \;
{\displaystyle...
...} \;
\cdots
{\displaystyle \int_{{\bf x}_T}} \;
L({\bf Y}\vert {\bf q})
$

is obtained by the product

$\displaystyle K({\bf q}) =
\prod_{m = j,k, \ldots p,q,r}
\;
{ {\cal N}\left(\ha...
...t{{\bf\Sigma}}+{\bf\Sigma}_m^{bb}\right) \over \vert{\rm det}({\bf A}_m)\vert}.$ (15.15)

Finally, the desired integral (15.2) is given by

$\displaystyle K_{\mbox{\tiny$T$}} = \sum_{{\bf q} \in {\cal \bf Q}} \; p({\bf q}) \; K({\bf q})$ (15.16)

Since there are $M^T$ elements in ${\cal \bf Q}$, the computation is of order $O(M^T)$, but the terms in (15.15) converge to a limiting distribution, since the ratio

$\displaystyle {Q({\bf x}_{t+1} \ldots {\bf x}_T; \{j,k \ldots n\}, \hat{\hat{\m...
... \{i,j,k,l \ldots p,q,r\}, \mbox{\boldmath $\mu$}, {\bf\Sigma}) } \rightarrow C$

quickly converges to a constant $C$. This convergence is related to the property of limiting distributions for Markov chains [79] and is fortunate because $L({\bf Y})$ needs only be calculated for a few values of $T$, then the constant $C$ stored.

We tested the expression for $K_{\mbox{\tiny $T$}}$ by comparing to the numerically-integrated PDF. We created samples of ${\bf X}$ by selecting the first $D$ MFCC coeffients extracted from some arbitrary samples of speech data and trained an HMM on samples of ${\bf Y}$. With HMM parameters held fixed, we evaluated $L_y({\bf D}({\bf X}))$ using the forward procedure on a fine grid spanning the $DT$-dimensional space of ${\bf X}$. In theory the integral equals 1.0 for $T=2$ since in this case, ${\bf X}$ and ${\bf Y}$ are equivalent. For $D=1$, we were able to carry out the numerical integration up to $T=5$. For $D=2$, the numerical integration could be carried out only up to $T=3$. Table 15.1 shows the comparison of $K_{\mbox{\tiny $T$}}$ with numerical integration as a function of $T$. Note the close agreement with $K_{\mbox{\tiny $T$}}$ from equation (15.16). The accuracy was limited by the grid sampling used in the numerical integration since it greatly affected the computation time. The ratio $K_{\mbox{\tiny $T$}}/K_{\mbox{\tiny $T-1$}}$ is shown to converge quite rapidly. Therefore the values $K_{\mbox{\tiny $T$}}$ can be extrapolated to much higher $T$ with no additional calculations.

Table: Comparison of numerically integrated likelihood function with equation (15.16) over feature dimension $D$ and length $T$. The number of Markov states was $N=2$.
$D$ $T$ Numerical result $K_{\mbox{\tiny $T$}}$ $K_{\mbox{\tiny $T$}}/K_{\mbox{\tiny $T-1$}}$
1 2 0.999999 1.000000 1
1 3 0.412523 0.412307 0.412307
1 4 0.191555 0.191275 0.463914
1 5 0.092301 0.092048 0.481233
1 6   0.044915 0.487951
1 7   0.022039 0.490682
1 8   0.010839 0.491809
1 9   0.005335 0.492204
1 10   0.002628 0.492442
1 11   0.001294 0.492506
1 12   0.000637 0.492526
1 13   0.000314 0.492529
2 2 0.99999 1.000000 1
2 3 0.06426 0.063756