The script software/skptool.m mostly follows the method of software/mrhmm_test1.m with manual labeling. We take the strategy of training all the data together in one batch, with forced alignment using the following set of phonetic units (subclasses):
subclassnames={'sil' 'S' 'PCL' 'P' 'KCL' 'K' 'TCL' 'T' 'OO1' 'OO2' 'L' };The state transition matrix, which is illustrated in the bottom-right pane of Figure 14.9, is highly structured because there is a distinct sequence of phonetic units. In all cases, the words begin with silence sil followed by S, then transition into one of the closures PCL, KCL, TCL, which are followed directly by the respective plosives P, K, T, then ending with OO1, OO2, L (this is my own nomenclature). The experiment leaves one of the 12 utterances of each class out during training (specified by variable itest), which is then segmented using Viterbi. The result is scored by looking to see if the Viterbi algorithm detects the plosive states P, K, T at the same time as the ground-truth labels. An example is shown in Figure 14.9, where a test sample of “stool" is classified using Viterbi. The burst “T" is located to within a few samples accuracy, something that would be impossible using standard speech processing using MFCC, for example. Run the script one time with init=1 to save the features to a file. Then, run with init=0 for each value of itest: 1,2,... 12. When testing, the intersection between the labeled plosives and the Viterbi segmentation is determined. The variable ndet counts the number of intersecting base segments. Note that best results were obtained with kbonus=5 and resulted in 1562 out of 1758 labeled base segments of the plosives being properly identified. There were no class errors.