Features

We extracted features by 2/3 overlapped hanning-weighted MEL frequency cepstral coefficient (MFCC) feature analysis [32]. For the TIMIT data, which is sampled at 16 KHz, we first downsampled the data to 12 KHz, then used 288-sample windows (24 milliseconds). For the office sounds data, which is sampled at 32 KHz, we used 288-sample windows (18 milliseconds). For both data sets, we used 24 Hanning-shaped MEL bands (including the zero and Nyquist bands), and no DCT truncation, producing a 24-dimensional feature.