Chain Rule
Most useful signal processing occurs in several stages.
We have up to now only considered the transformation as a whole
rather than the individual stages. Assuming that the transformation
can be broken into parts, for example,
equation (2.2) takes on the chainrule form:

(2.9) 
where
are reference hypotheses
used at each stage. The Jfunction of
is the product of the three
stagewise Jfunctions.
To understand the importance of the chainrule,
consider how we would cope without it. We would need to
solve for
, the PDF of under the assumption that
had PDF . But, at each stage, the distribution
of the output feature becomes more and more intractable.
Thus, at the end of a long signal processing
chain, we would be unable to derive
.
Estimating
is futile since the
is generally evaluated in the far tails of the distribution,
and can only realistically be represented in log form,
and by a closedform expression.
On the other hand, using the chainrule, we can ``restart"
the process by assuming a suitable canonical form for
at the start of each stage.
As we mentioned, and will shortly see, for MaxEnt, this canonical will be
decided by the choice of feature transformation.
Incidentally, (2.9) can be used for the exact calculation of
.
Let
be the combined transformation of the chain.
Then,
above is also equal to
. Using (2.9) and (2.2),
we have

(2.10) 
This is an exact relationship.
Its implementation only requires that we can solve for the PDF of the
feature at the output of each stage under the reference hypothesis proposed
at the input of the stage.
Baggenstoss
20170519