Interpretation of the J-function
The J-function is a measure of the ability of the
features to describe the input data.
Mathematically, the J-function is equal to
the manifold density (2.5) [3].
The manifold density comes into play when we generate
samples from
. For maximum entropy
PDF projection (See Chapter 3), the manifold density is
the uniform density (see Section 3.3).
The manifold is a range of input data values
that map to a given feature value. So, if the features
are very descriptive, and accurately describe
the peculiarities of the given data sample,
the range of possible input data values shrinks,
increasing the value of manifold density.
Another interpretation, based on asymptotic
maximum likelihood (ML) theory, starts by
assuming that there exists some parametric model
such that the features are
maximum likelihood estimates of the parameters,
.
The J-function for ML, given in (2.27), is dominated by
the numerator, which is the likelihood function of the data
evaluated at
.
Thus, the J-function has the interpretation as
a quantitative measure of how well the parametric
model can describe the raw data. The better the features, the better this
notional parametric model. Interestingly, because the J-function
can be computed without
actually implementing the ML estimator, this information
is available without needing to know the parametric form nor
needing to maximize it!
Naturally, there are situations where this information
is detrimental to classification - specifically if the data contains
nuisance information or interference.
There are work-arounds that significantly
improve classification performance, for example the
class-specific feature mixture ([20], section II.B).
There is also a geometric interpretation that lends insight
into the general approach. For the uniform reference
hypothesis (
),
(2.2) reduces to
,
but
is just the volume of the manifold (2.4):
Therefore,
.
This touches on the idea of information maximization [21].
In a general sense, maximizing information throughput
of a transformation comes down to finding feature transformations with small manifolds
given feature values , something that one could call back-projected volume.
This interpretation is implicit in the statement by Nadal
“A large amount of information is obtained if one
can discriminate finely the input signal." [22].
From the point of view of unsupervised learning,
when data of multiple classes is present, it makes sense that
when the average back-projected volume is minimized, data becomes better separated
in the output distribution. This is because a given feature value will map back
to a smaller region in
, which is less likely to have data of
more than just one class.