Cepstral Analysis

Cepstral analysis is a widely-used means to extract meaningful information about the spectrum. The most widely-used method is the MEL frequency cepstral coefficients (MFCC) [32]. They generally consist of these steps:

Discrete Fourier transform (DFT) followed by calculating the magnitude or squared magnitude of the bins to produce a raw spectral vector.
Energy binning by inner product of the raw spectral vector with a bank of positive-valued spectral band functions.
Log (taking the logarithm of the binned band energies).
(optional) Discrete cosine transform (DCT), with optional truncation (elimination of some highest-frequency bins).

There are diverse implementations. Either magnitude or magnitude-squared can be used prior to energy binning. As there is little difference, and the magnitude-squared method we have thoroughly analyzed and implemented, we will stick to that. The bank of spectral band functions can have band centers and bandwidths that are either linearly or non-linearly spaced. The DCT can be truncated, non-truncated, or even left out completely. From an information content point of view, the DCT/Truncation step does nothing except perhaps smoothing - and smoothing can be equally accomplisted by choosing smoother or wider spectral band functions. The DCT mainly serves to produce largely independent output features which may allow the use of diagonal covariance matrices in the PDF estimation step.

Rather than truncating the DCT coefficients, we believe it is better to use fewer but fatter spectral band functions, then keep all the DCT bins. This produces “tighter" features from an information point of view, as we will see in Section 11.1. The first (zero-frequency) DCT coefficient is often left out in practice. As leaving this out disturbs the energy statistic (Section 3.2.2), we prefer to keep it. If the practitioner wishes to ignore the information in the zero-frequency DCT coefficient, this can be handled by assigning a fixed prior distribution to it in the PDF estimation step. This will have the same effect in the classifier as eliminating it altogether.

The factor with the most influence is the number shape, and spacing of the spectral band functions. The cepstral features can be produced (excepting the last DCT step) using the following module chain (assume x is the $N\times 1$ input data vector):

  [y,jout]=module_dftmsq(x,0);
  [w,jout]=module_mel_XXX(y,jout,...);
  [u,jout]=module_log(w,jout);
  z=dct(u);

These four modules correspond one-to-one with the first four steps Note that the DCT step requires no modification of the J-function since it is an invertible transformation with Jacobian $\vert J\vert=1.$ . Above, module_mel_XXX can be either module_mel_bank, which is described in Sections 5.2.6 and 5.2.3, and in Section 5.3.7 in the context of sampling inversion. Or, module_mel_XXX can be module_mel_ml which is described in Section 5.2.9. These modules have options to produce linear of MEL-spaced spectral band functions. But, even when module_mel_bank and module_mel_ml are set up to produce the same bank of band functions, they produce different features. Let ${\bf x}$ be the raw spectral vector, and ${\bf A}$ be the matrix of spectral band functions. As we have explained in Section 5.2.9, the first produces the closed-form feature ${\bf z}={\bf A}^\prime {\bf x}$ , and the second iteratively seeks the ${\bf z}$ that approximates ${\bf x}$ through ${\bf x}\sim {\bf A} {\bf z}$ . We'll compare these in Section 11.1.

The band spacing and shape of the spectral band functions are controlled by the two variables fs, type. fs is the assumed sample rate for setting up the spacing and center frequencies of the MEL band function. The variable type controls whether the bands are spaced on the MEL scale ot linear scale, and if the band shapes are triangular, hanning-shaped, or rectangular as follows:

% type | shape      |  spacing 
% ---------------------------------
% 0     hanning,    MEL      Half
% 1     triangular  MEL      Half
% 2     hanning,    linear   Half
% 3     triangular  linear   Half
% 4     ACF         ---      Half
% 5     block       MEL      Half
% 6     block       linear   Half

The ACF selection results int he calculation of the autocorrelation function (ACF) when applied to the magnitude-squared DFT bins. Note that the end bin values are set to “half", so the sum of the band functions at the zero and Nyquist frequency bins will be 1/2, while it will be 1 in all others. This preserves in the features the total energy at the DFT input, i.e. it controls the energy statistic.

If we are going to truncate the DCT from down to coefficients, we model it as a matrix multiplication ${\bf z}={\bf A}^\prime {\bf x}$ , where ${\bf A}$ is the truncation matrix A=eye(K,D) along with the Gaussian reduction in Section 4.4, i.e. use software/module_lin_gauss.m. Thus, let z be the DCT output above. To trucate, we continue with:

  A=eye(K,D);
  [zt,jout]=module_lin_gauss(z,jout,A);

Alternatively, we can build the matrix A to do both the DCT and truncation. Thus, let u be the output of the log operation above. We continue with:

  A=idct(eye(K,D));
  [z,jout]=module_lin_gauss(u,jout,A);

At this point, the last (-th)element of ${\bf z}$ is the energy statistic. It makes sense to take the log of that:

  [z,jout]=module_log(z,jout,D+1);

If the information content in the energy statistic is not wanted, it can be handled like the zero-frequency DCT bin above. Thus, in summary, if truncated DCT is wanted (we don't recommend it for reasons we will explain in Section 11.1), then use:

  [y,jout]=module_dftmsq(x,0);
  [w,jout]=module_mel_XXX(y,jout,...);
  [u,jout]=module_log(w,jout);
  A=idct(eye(K,D));
  [z,jout]=module_lin_gauss(u,jout,A);
  [z,jout]=module_log(z,jout,D+1);

Subsections