Sampling and the feature bottleneck.

To avoid problems associated with PDF estimation at high dimensions (time-series, image, etc), practitioners of generative methods often prefer to begin with a dimension-reducing feature extraction. But, in problems with many and varied data classes, it is difficult to find a single low-dimensional feature vector that contains all the necessary information. This can be called the feature bottleneck. Aside from the potential information loss, using features greatly limits the usefulness of generative models. All generative models can, at least theoretically, be used to generate random samples. Generating random features has little value since the synthetic features have no obvious way to be interpreted outside of the classifier itself. If robust generative models could be created on the input data space, then synthetic data in the form of time-series, images, etc, could be interpreted or visualized intuitively, and processed by alternative means in meaningful ways. Uses of sampling include
  1. Creating realistic simulated data for experiments.
  2. Validation of the distribution $ p({\bf x})$, by observing the quality and suitability of the generated samples.
  3. Monte Carlo integration. The integral $ I = \int_{{\bf x}} h({\bf x}) p({\bf x}) {\rm d} {\bf x}$ is approximated using $ I \simeq \frac{1}{m} \sum_{i=1}^m h({\bf x}_i), $ for large $ m$, where the samples $ x_i$ are drawn from $ p({\bf x})$.
  4. Creating hybrid generative/discriminative models. Let $ y({\bf x})$ be a decision rule taking the value 1 (accept) or 0 (reject). We create hybrid generative model $ h({\bf x})=\frac{1}{c} \; y({\bf x})\; p({\bf x}),$ where the normalizing constant $ c$ is obtained by Monte Carlo integration $ c = \frac{1}{m} \sum_{i=1}^m y({\bf x}_i), $ where $ {\bf x}_i$ are samples samples of $ p({\bf x})$ that have been accepted by $ y({\bf x})$. This model is a true generative model with the qualities of both the discriminative and generative components.

Baggenstoss 2017-05-19