Data Labeling

To get started, it is necessary to identify a small number of signal classes (see definition in Section 14.3.1), then identify regions in the data as examples of the signal classes. The labeling can be done manually, or automatically.

Manual Labeling. In manual labeling, we choose a set of signal class (subclass) names:

       subclassnames={'sil'  'S'  'PCL' 'P'  'KCL' 'K' 'TCL' 'T' 'OO1' 'OO2' 'L'  };

then use a data labeling tool to segment the data. The input of the data labeling tool is a cell array of time-series (events). The output of the data labeling tool must be a cell array (one entry per event), and each event cell is a cell array of segments. Each segment cell is a cell array containing (a) a pair of integers defining the first and last sample of the segment, and the signal class name. Example: event one has two segments 'firsthalf' and 'secondhalf', and event two has three segments 'front', 'middle', 'back'. We would have:

   % first event
   Idx{1}={  { [ 1 100] , 'firsthalf' } , 
          { [ 101 200] , 'secondhalf' } } ;
   % second event
   Idx{2}={  { [ 1 100] , 'front' } ,
             { [ 101 200] , 'middle' } , 
             { [ 201 300] , 'back' }   };

The function software/data_label.m does this. For example:

    subclassnames={'sil'  'S'  'PCL' 'P'  'KCL' 'K' 'TCL' 'T' 'OO1' 'OO2' 'L'  };
    Idx={};
    fs=32000; % sample rate
    NFFT=64; % FFT size for spectrogram display
    Idx=data_label(X,fs,NFFT,subclassnames,Idx);

Once the program is started, press 'h' for a list of shortcut keys. The program can be run again to add segments to the cell array Idx.

Auto-segmentation and clustering. To automatically label the data, the data is divided up into segments of consistent spectral character by an automatic segmenter using a general-purpose time-series segmentation algorithm [58]. Figure 14.5 illustrates the segmentation of a time-series from the “Office Sounds" database.

**Figure 14.5:** Illustration of segmenting a time-series. Top: time-series, with box drawn around the data of each segment. Bottom: spectrogram.
$\includegraphics[width=5.0in, height=3.8in]{seg1-a.eps}$

Once data has been segmented, as in Figure 14.5, we automatically determine a set of ncluster signal-classes. We cluster the spectral content of the segments into ncluster clusters. For a common feature space that is mostly invariant to segment size, we use cepstral coefficients (the DCT of the log of the magnitude-squared DFT), keeping always 25 coefficients regardless of segment size. If the segment size is less than 48, we zero-pad to 25 coefficients. Then, the collection of 25-dimensional segment cepstral vectors from all segments is clustered using K-means to ncluster clusters. Each segment is then classified as one of the ncluster signal-classes. Let data X and the variables K, Ns, Ps, fs be defined as above. The segmentation and spectral clustering are implemented by the functions software/segment_data.m and software/cluster_data.m thus:

    ncluster=4;
    Seg=segment_data(X,K,Ns,Ps,fs,0);
    [Idx,Mu,wts]=cluster_data(X,Seg,ncluster,fs,0);

    % give the subclasses arbitrary names
    for i=1:ncluster, subclassnames{i}=sprintf('Cls%d',i); end;

Running software/cluster_data.m more than once produces different results since the clustering is randomly inititalized. Note that the Idx variable out of software/cluster_data.m has a different format from the output of the manual labeling tool, which is also supported. The index array for each event has three columns, start sample, end sample, and cluster number. The case shown above that comes from data_label would instead look like:

   % first event
   Idx{1}=  [   1 100  1;
              101 200  2 ];
   Idx{2}=  [   1 100  3; 
               101 200  4;
               201 300  5];