Determining the number of modes.

As we have stated, training can start with a large number of modes or just one mode. If the number of modes is too high, modes will be pruned out as $\alpha_i$ falls. If the number of modes is too low, modes will be split by software/gmix_kurt.m. Once the number of modes settles out and the likelihood stops increasing, convergence is declared.

The maximum number of modes to start with is about $N/(4P)$ where $P$ is the dimension and $N$ is the number of samples. If all the modes “share" the data equally, that is $4P$ samples per mode, a bare minimum. It is generally not problematic if the number of modes is over-specified since covariance estimates are stabilized by the conditioning discussed in section 13.2.4. And, as long as the amount of training data can support the number of modes chosen, the approximation is good. The mixing weight of a mode ($\alpha_i$) multiplied by the number of input data samples $N$ determines how many input data samples are effectively used to estimate the mode parameters. This is a simple measure of the “value" of each mode. As long as this product is high enough, the mode is estimated accurately. If $\alpha_i$ falls too low, the mode is eliminated or combined with another. With a combination of covariance constraints, pruning, merging, and mode splitting, a good PDF approximation can be obtained reliably.