Chain Rule

Most useful signal processing occurs in several stages. We have up to now only considered the transformation as a whole rather than the individual stages. Assuming that the transformation ${\bf z}=T({\bf x})$ can be broken into parts, for example,

$\displaystyle {\bf y}=T_y({\bf x}), \; \; {\bf w}=T_w({\bf y}), \; \; {\bf z}=T_z({\bf w}),$

equation (2.2) takes on the chain-rule form:

$\displaystyle G({\bf x}) =
\left[ \frac{ p({\bf x}\vert H_{0x})} {p({\bf y}\ver...
... \frac{ p({\bf w}\vert H_{0w})} {p({\bf z}\vert H_{0w})} \right] \; g({\bf z}),$ (2.9)

where $H_{0x}, H_{0y}, H_{0w}$ are reference hypotheses used at each stage. The J-function of $T({\bf x})$ is the product of the three stage-wise J-functions. To understand the importance of the chain-rule, consider how we would cope without it. We would need to solve for $p({\bf z}\vert H_0)$, the PDF of ${\bf z}$ under the assumption that ${\bf x}$ had PDF $H_0$. But, at each stage, the distribution of the output feature becomes more and more intractable. Thus, at the end of a long signal processing chain, we would be unable to derive $p({\bf z}\vert H_0)$. Estimating $p({\bf z}\vert H_0)$ is futile since the $p({\bf z}\vert H_0)$ is generally evaluated in the far tails of the distribution, and can only realistically be represented in log form, and by a closed-form expression.

On the other hand, using the chain-rule, we can “re-start" the process by assuming a suitable canonical form for $H_0$ at the start of each stage. As we mentioned, and will shortly see, for MaxEnt, this canonical $H_0$ will be decided by the choice of feature transformation.

Incidentally, (2.9) can be used for the calculation of $p({\bf z}\vert H_{0x})$, which is otherwise intractible. Let ${\bf z}=T({\bf x})$ be the combined transformation of the chain. Then, $G({\bf x})$ above is also equal to $G({\bf x}; H_{0x},T,g)$. Using (2.9) and (2.2), we have

$\displaystyle p({\bf z}\vert H_{0x})=
\frac{ p({\bf x}\vert H_{0x})} {J({\bf x}...
...({\bf w}\vert H_{0y})} {p({\bf w}\vert H_{0w})} \right]
p({\bf z}\vert H_{0w}).$ (2.10)

The implementation of (2.10) requires that we can solve for the PDF of the feature at the output of each stage under the reference hypothesis proposed at the input of the stage.