Informatica 30 (2006) 373-384 373 Isolated Words Recognition System Based on Hybrid Approach DTW/GHMM E- Hocine Bourouba, Mouldi Bedda and Rafik Djemili Department of electronic Faculty of Engineering, University of Annaba, Algeria Automatic and Signals Laboratory E-mail: {Bourouba2004, mouldi_bedda, djemili_r}@yahoo.fr Keywords: speech recognition, hidden Markov models, dynamic time warping, hybrid system Received: February 11, 2005 In this paper, we present a new hybrid approach for isolated spoken word recognition using Hidden Markov Model models (HMM) combined with Dynamic time warping (DTW). HMM have been shown to be robust in spoken recognition systems. We propose to extend the HMM method by combining it with the DTW algorithm in order to combine the advantages of these two powerful pattern recognition technique. In this work we do a comparative evaluation between traditional Continuous Hidden Markov Models (GHMM), and the new approach DTW/GHMM. This approach integrates the prototype (word reference template) for each word in the training phase of the Hybrid system. An iterative algorithm based on conventional DTW algorithm and on an averaging technique is used for determined the best prototype during the training phase in order to increase model discrimination. The test phase is identical for the GHMM and DTW/GHMM methods. We evaluate the performance of each system using several different test sets and observe that, the new approach models presented the best results in all cases Povzetek: V sestavku so opisane hibridne metode za prepoznavanje besed. 1 Introduction Automatic speech recognition has been an active research topic for more than four decades. With the advent of digital computing and signal processing, the problem of speech recognition was clearly posed and thoroughly studied. These developments were complemented with an increased awareness of the advantages of conversational systems. The range of the possible applications is wide and includes: voice-controlled appliances, fully featured speech-to-text software, automation of operator-assisted services, and voice recognition aids for the handicapped.... Different approaches in speech recognition have been adopted. They can be divided mainly in three trends namely Dynamic Time Warping (DTW), Hidden Markov Models (HMM), and Artificial Neural Networks (ANN). The introducing of speech HMM has made an impact and has enabled great progress during these last few years. However, there is a lot to be accomplished in this area in order to improve their quality, i.e. the re-enforcing of the discrimination between different models, which seems to be very promising. In the 1990's, a fourth technique called Hybrid Approach was introduced. The combination of the multiple methods produced a more precise final result because it exploited the advantages of each one. This Combination seems to constitute an interesting approach in speech recognition. Most the new speech recognition systems are now based on hybrid approach HMM/ANN. HMM has a great capacity to treat events in time, while ANN is an expert in the classification of static forms. The main solution s suggested to compensate the lack of discrimination in the Markov models come in the model training phase. An alternative approach consists in a local introducing of the discrimination in the model's definition. Among existing methods, the utilizing of ANN's as a discriminating probability estimator has proven to be efficient; nevertheless, it is costly and difficult to put into action. Re-enforcing discrimination techniques between models by a re-estimation of model parameters based on a ANN discriminating criteria are complex, and don't provide a guarantee of convergence for the learning procedure. The approach we propose relies on the principle that the global discrimination between Markov models can be obtained from a discrimination of the models training sequences, and that by a transformation the representing space using the time alignment. Thus, we have developed an iterative algorithm to extract a most suitable prototype favoring the discrimination among the data classes from the training set the derived criteria can be summarized as follows: after the alignment of the sequences of each class by its prototype, each class becomes the most 374 Informatica 30 (2006) 373-384 E-H. Bourouba et al. regrouped possible, and the of classes the most dispersed possible. The work presented in this paper is an alternative hybrid approach DTW/HMM used in speech recognition using hidden Markov model with DTW algorithm. The goal of this work is to apply DTW to solve the lack of discrimination in the Markov models. A basic idea is that even if DTW has been proven successful in modeling the temporal structure of the speech signal, it is not capable of assimilating a wide variety of speaker dependent spectrum pattern variations; on the other hand, training HMMs for recognizing spoken words is not discriminate So, for example, combining the high time alignment capabilities of DTW with the flexible learning function of the HMM is expected to lead to an advanced recognition model suitable to isolated speech recognition problems. The new approach GHMM/DTW is introduced, evaluated and compared with traditional approach GHMM for isolated word recognition system. Both these approaches apply the same principles of feature extraction and time-sequence modeling; the principal difference lies in the architecture used for training phases. The rest of the paper is organized as follows. In the next section, we introduce the acoustic modeling used in our experiments. In section 3 and 4, we discuss some aspects of GHMM and DTW Section 5 then present the existing hybrid system. In section 6 , we will discuss more amply the hybrid approach with the iterative algorithm based the DTW technique . In the next section we presented the experiments examine the performance of GHMM and DTW/GHMM on Arabic and French isolated word. Finally, section IV gives a summary and conclusion. 2 Feature extraction In this phase speech signal is converted into stream of feature vectors coefficients which contain only that information about given utterance that is important for its correct recognition. An important property of feature extraction is the suppression of information irrelevant for correct classification, such as information about speaker (e.g. fundamental frequency) and information about transmission channel (e.g. characteristic of a microphone). The feature measurements of speech signals are typically extracted using one of the following spectral analysis techniques: MFCC Mel frequency filter bank analyzer, LPC analysis or discrete Fourier transform analysis. Currently the most popular features are Mel frequency cepstral coefficients MFCC [3]. 2.1 MFCC Analysis The Mel-Filter Cepstral Coefficients are extracted from the speech signal as shown in the block diagram of Figure 1. The speech signal is pre-emphasized, framed and then windowed, usually with a Hamming window. Mel-spaced filter banks are then utilized to get the Mel-spectrum. The natural Logarithm is then taken to transform into the Cepstral domain and the Discrete Cosine Transform is finally computed to get the MFCCs. Figure 2 shows the Mel-spaced filter banks that are used to get the Mel-spectrum. N Ck =^log(Ei)xcos i=1 [ (i 1) N(i 2) (1) The following denotes the acronyms used in the block diagram: - W : Frame Blocking and Windowing - FFT: Fast Fourier Transform - LOG: Natural Logarithm - DCT: Discrete Cosine Transform Sptvch k w > [.(x; ► ixn J Mi-'CCs Figure 1: Mel-scale cepstral feature analysis. Mr-I i ■»"■id 11« kill I r-Mrrj n Hi Figure 2: Mel-Spaced Filter Banks. 2.1.1 Pre-emphasis In general, the digitized speech waveform has a high dynamic range. In order to reduce this range pre-emphasis is applied. By pre-emphasis [1], we imply the application of a high pass filter, which is usually a first -order FIR of the form H (z) = 1 - a x z-1. The pre-emphasize is implemented as a fixed-coefficient filter or as an adaptive one, where the coefficient a is adjusted with time according to the autocorrelation values of the speech. The pre-emphasizer has the effect of spectral flattening which renders the signal less susceptible to finite precision effects (such as overflow and underflow) in any subsequent processing of the signal. The selected value for a in our work is 0.9375. ISOLATED WORDS RECOGNITION SYSTEM. Informatica 30 (2006) 373-384 375 2.1.2 Frame blocking Since the vocal tract moves mechanically slowly, speech can be assumed to be a random process with slowly varying properties [1]. Hence, the speech is divided into overlapping frames of 20ms every 10ms. The speech signal is assumed to be stationary over each frame and this property will prove useful in the following steps. S={Si} and of the observation string produced as a result of emitting a vector Ot for each successive transitions from one state Si to a state Sj. Ot is d dimension and in the discrete case takes its values in a library of M symbols. The state transition probability distribution between state Si to Sj is A={aij}, and the observation probability distribution of emitting any vector Ot at state Sj is given by B={bj(Ot)}. The probability distribution of initial state is n={ n}. Figure 3 : Frame blocking Step. 2.1.3 Windowing To minimize the discontinuity of a signal at the beginning and end of each frame, we window each frame frames [1]. The windowing tapers the signal to zero at the beginning and end of each frame. A typical window is the Hamming window of the form: w(ra)=0.54-0.46cos(|^-) 0