Illustrative Example 2: Using Manual Segmentation: Analysis of speech

In the following example, we use manual segmentation to surgically analyze human speech and thereby analyze the subtle differences between the words “spool", “skool" (school), and “stool". The data can be found at http://class-specific.com/csf/spoolskool.tgz. The experiment models the sounds “P", “K", and “T". Each of these sounds are plosives which consist of a closure (full cut-off of airflow) followed by a short burst leading into the following vowel. Specifically, the experiment tries to locate the bursts. The script software/skptool.m runs the example. The data set includes recordings of myself speaking the words “stool", “spool", and “skool" (school), each 12 times.

The script software/skptool.m mostly follows the method of software/mrhmm_test1.m with manual labeling. We take the strategy of training all the data together in one batch, with forced alignment using the following set of phonetic units (subclasses):

   subclassnames={'sil'  'S'  'PCL' 'P'  'KCL' 'K' 'TCL' 'T' 'OO1' 'OO2' 'L'  };
The state transition matrix, which is illustrated in the bottom-right pane of Figure 14.9, is highly structured because there is a distinct sequence of phonetic units. In all cases, the words begin with silence sil followed by S, then transition into one of the closures PCL, KCL, TCL, which are followed directly by the respective plosives P, K, T, then ending with OO1, OO2, L (this is my own nomenclature). The experiment leaves one of the 12 utterances of each class out during training (specified by variable itest), which is then segmented using Viterbi. The result is scored by looking to see if the Viterbi algorithm detects the plosive states P, K, T at the same time as the ground-truth labels. An example is shown in Figure 14.9, where a test sample of “stool" is classified using Viterbi. The burst “T" is located to within a few samples accuracy, something that would be impossible using standard speech processing using MFCC, for example. Run the script one time with init=1 to save the features to a file. Then, run with init=0 for each value of itest: 1,2,... 12. When testing, the intersection between the labeled plosives and the Viterbi segmentation is determined. The variable ndet counts the number of intersecting base segments. Note that best results were obtained with kbonus=5 and resulted in 1562 out of 1758 labeled base segments of the plosives being properly identified. There were no class errors.
Figure 14.9: MR-HMM testing a sample of class “stool" using Viterbi algorithm.
\includegraphics[width=6.0in]{mrhmm_test1d.eps}