Informatica 30 (2006) 373-384 373
Isolated Words Recognition System Based on Hybrid Approach DTW/GHMM
E- Hocine Bourouba, Mouldi Bedda and Rafik Djemili Department of electronic
Faculty of Engineering, University of Annaba, Algeria Automatic and Signals Laboratory
E-mail: {Bourouba2004, mouldi_bedda, djemili_r}@yahoo.fr
Keywords: speech recognition, hidden Markov models, dynamic time warping, hybrid system Received: February 11, 2005
In this paper, we present a new hybrid approach for isolated spoken word recognition using Hidden Markov Model models (HMM) combined with Dynamic time warping (DTW). HMM have been shown to be robust in spoken recognition systems. We propose to extend the HMM method by combining it with the DTW algorithm in order to combine the advantages of these two powerful pattern recognition technique. In this work we do a comparative evaluation between traditional Continuous Hidden Markov Models (GHMM), and the new approach DTW/GHMM. This approach integrates the prototype (word reference template) for each word in the training phase of the Hybrid system. An iterative algorithm based on conventional DTW algorithm and on an averaging technique is used for determined the best prototype during the training phase in order to increase model discrimination. The test phase is identical for the GHMM and DTW/GHMM methods. We evaluate the performance of each system using several different test sets and observe that, the new approach models presented the best results in all cases
Povzetek: V sestavku so opisane hibridne metode za prepoznavanje besed.
1 Introduction
Automatic speech recognition has been an active research topic for more than four decades. With the advent of digital computing and signal processing, the problem of speech recognition was clearly posed and thoroughly studied. These developments were complemented with an increased awareness of the advantages of conversational systems. The range of the possible applications is wide and includes: voice-controlled appliances, fully featured speech-to-text software, automation of operator-assisted services, and voice recognition aids for the handicapped....
Different approaches in speech recognition have been adopted. They can be divided mainly in three trends namely Dynamic Time Warping (DTW), Hidden Markov Models (HMM), and Artificial Neural Networks (ANN).
The introducing of speech HMM has made an impact and has enabled great progress during these last few years. However, there is a lot to be accomplished in this area in order to improve their quality, i.e. the re-enforcing of the discrimination between different models, which seems to be very promising.
In the 1990's, a fourth technique called Hybrid Approach was introduced. The combination of the multiple methods produced a more precise final result because it exploited the advantages of each one. This Combination seems to constitute an interesting approach in speech recognition.
Most the new speech recognition systems are now based on hybrid approach HMM/ANN. HMM has a great capacity to treat events in time, while ANN is an expert in the classification of static forms.
The main solution s suggested to compensate the lack of discrimination in the Markov models come in the model training phase. An alternative approach consists in a local introducing of the discrimination in the model's definition. Among existing methods, the utilizing of ANN's as a discriminating probability estimator has proven to be efficient; nevertheless, it is costly and difficult to put into action. Re-enforcing discrimination techniques between models by a re-estimation of model parameters based on a ANN discriminating criteria are complex, and don't provide a guarantee of convergence for the learning procedure. The approach we propose relies on the principle that the global discrimination between Markov models can be obtained from a discrimination of the models training sequences, and that by a transformation the representing space using the time alignment. Thus, we have developed an iterative algorithm to extract a most suitable prototype favoring the discrimination among the data classes from the training set the derived criteria can be summarized as follows: after the alignment of the sequences of each class by its prototype, each class becomes the most
374 Informatica 30 (2006) 373-384
E-H. Bourouba et al.
regrouped possible, and the of classes the most dispersed possible.
The work presented in this paper is an alternative hybrid approach DTW/HMM used in speech recognition using hidden Markov model with DTW algorithm. The goal of this work is to apply DTW to solve the lack of discrimination in the Markov models. A basic idea is that even if DTW has been proven successful in modeling the temporal structure of the speech signal, it is not capable of assimilating a wide variety of speaker dependent spectrum pattern variations; on the other hand, training HMMs for recognizing spoken words is not discriminate So, for example, combining the high time alignment capabilities of DTW with the flexible learning function of the HMM is expected to lead to an advanced recognition model suitable to isolated speech recognition problems.
The new approach GHMM/DTW is introduced, evaluated and compared with traditional approach GHMM for isolated word recognition system. Both these approaches apply the same principles of feature extraction and time-sequence modeling; the principal difference lies in the architecture used for training phases.
The rest of the paper is organized as follows. In the next section, we introduce the acoustic modeling used in our experiments. In section 3 and 4, we discuss some aspects of GHMM and DTW Section 5 then present the existing hybrid system. In section 6 , we will discuss more amply the hybrid approach with the iterative algorithm based the DTW technique . In the next section we presented the experiments examine the performance of GHMM and DTW/GHMM on Arabic and French isolated word. Finally, section IV gives a summary and conclusion.
2 Feature extraction
In this phase speech signal is converted into stream of feature vectors coefficients which contain only that information about given utterance that is important for its correct recognition. An important property of feature extraction is the suppression of information irrelevant for correct classification, such as information about speaker (e.g. fundamental frequency) and information about transmission channel (e.g. characteristic of a microphone). The feature measurements of speech signals are typically extracted using one of the following spectral analysis techniques: MFCC Mel frequency filter bank analyzer, LPC analysis or discrete Fourier transform analysis. Currently the most popular features are Mel frequency cepstral coefficients MFCC [3].
2.1 MFCC Analysis
The Mel-Filter Cepstral Coefficients are extracted from the speech signal as shown in the block diagram of Figure 1. The speech signal is pre-emphasized, framed and then windowed, usually with a Hamming window. Mel-spaced filter banks are then utilized to get the Mel-spectrum. The natural Logarithm is then taken to
transform into the Cepstral domain and the Discrete Cosine Transform is finally computed to get the MFCCs. Figure 2 shows the Mel-spaced filter banks that are used to get the Mel-spectrum.
N
Ck =^log(Ei)xcos
i=1
[ (i 1) N(i 2)
(1)
The following denotes the acronyms used in the block diagram:
-	W : Frame Blocking and Windowing
-	FFT: Fast Fourier Transform
-	LOG: Natural Logarithm
-	DCT: Discrete Cosine Transform

Sptvch k
w
> [.(x; ► ixn
J
Mi-'CCs
Figure 1: Mel-scale cepstral feature analysis.
Mr-I i ■»"■id 11« kill
I r-Mrrj n Hi
Figure 2: Mel-Spaced Filter Banks.
2.1.1 Pre-emphasis
In general, the digitized speech waveform has a high dynamic range. In order to reduce this range pre-emphasis is applied. By pre-emphasis [1], we imply the application of a high pass filter, which is usually a first -order FIR of the form H (z) = 1 - a x z-1.
The pre-emphasize is implemented as a fixed-coefficient filter or as an adaptive one, where the coefficient a is adjusted with time according to the autocorrelation values of the speech. The pre-emphasizer has the effect of spectral flattening which renders the signal less susceptible to finite precision effects (such as overflow and underflow) in any subsequent processing of the signal. The selected value for a in our work is 0.9375.
ISOLATED WORDS RECOGNITION SYSTEM.
Informatica 30 (2006) 373-384 375
2.1.2 Frame blocking
Since the vocal tract moves mechanically slowly, speech can be assumed to be a random process with slowly varying properties [1]. Hence, the speech is divided into overlapping frames of 20ms every 10ms. The speech signal is assumed to be stationary over each frame and this property will prove useful in the following steps.
S={Si} and of the observation string produced as a result of emitting a vector Ot for each successive transitions from one state Si to a state Sj. Ot is d dimension and in the discrete case takes its values in a library of M symbols. The state transition probability distribution between state Si to Sj is A={aij}, and the observation probability distribution of emitting any vector Ot at state Sj is given by B={bj(Ot)}. The probability distribution of initial state is n={ n}.
Figure 3 : Frame blocking Step. 2.1.3 Windowing
To minimize the discontinuity of a signal at the beginning and end of each frame, we window each frame frames [1]. The windowing tapers the signal to zero at the beginning and end of each frame. A typical window is the Hamming window of the form:
w(ra)=0.54-0.46cos(|^-) 0<N<N-1 (2)
3 Hidden Markov model
3.1	Introduction
A Hidden Markov Model (HMM) is a type of stochastic model appropriate for non stationary stochastic sequences, with statistical properties that undergo distinct random transitions among a set of different stationary processes. In other; words, the HMM models a sequence of observations as a piecewise stationary process. Over the past years, Hidden Markov Models have been widely applied in several models like pattern [4,5], or speech recognition [6, 7]. The HMMs are suitable for the classification from one or two dimensional signals and can be used when the information is incomplete or uncertain. To use a HMM, we need a training phase and a test phase. For the training stage, we usually work with the Baum-Welch algorithm to estimate the parameters (nA,B) for the HMM [8, 9]. This method is based on the maximum likelihood criterion. To compute the most probable state sequence, the Viterbi algorithm is the most suitable.
3.2	Basic HMM
A HMM model is basically a stochastic finite state automaton, which generates an observation string, that is, the sequence of observation vectors, O=O1,..Ot,... ,OT. Thus, a HMM model consists of a number of N states
av= P(q+i = Sj |q, = S, ) (3) bj (O, ) = P(0\q, = Sj ) (4) n, = P(qo = S, )	(5)
Then, given a observation sequence O, and a HMM model X=(A,BA), we can compute P(O\I) the probability of the observed sequence by means of the forward-backward procedure [10]. Concisely, the forward variable is defined as the probability of the partial observation sequence OhO2 ,...,Ot (until time t) and state Si at time t, with the model I, as at(i). And the backward variable is defined as the probability of the partial observation sequence form t+1 to the end, given state Si at time t and the model I, as pt(i). The probability of the observation sequence is calculated as:
N	N
P(0|2) = £a, (i)P, (i) = YaOCj (i) (6)
and the probability of being in state i at time t, given the observation sequence O, and the model I, as:
Y (,) =
a, (0Ä (,) P(O|2)
(7)
The ergodic or fully connected HMM is a HMM with all states linked all together (every state can be reached from any state). The left-right (also called Bakis) is an HMM with the matrix transition defined as:
a.j = 0 if j < i
aij = 0 if j < i + A (8)
bi(o1)
bj(o,)

1	1 1	1 1	1 1	1 1	1 1	1
Figure 4: A Bakis (or left rigth) HMM.						
374 Informatica 30 (2006) 376-384
E-H. Bourouba et al.
We adjust the model parameter X=(A,B, n,) to maximize the probability of the observation sequence. Consequently, given W classes to recognize, we need to train Xw for w=1...W HMM, one for each class, with the data set corresponding to the class w. We accomplish the above task using the iterative Baum-Welch method, which is equivalent to the EM (Expectation-Modification) procedure.
The Baum-Welch method, developed in this work as follows:
1.	Estimate an initial HMM model as X=(A,B, n).
2.	Given X and the observation sequence O, we calculate a new model ) , X = (A, B, n,) such as:
T -1
ZY (j, k) J=- (13)
Cjk T M
ZZY (j, k)
T-1
ZY (j, k) x 0, U = ^- (14)
Z Y (j, k)
i=i
ZY (j, k)(0, -vJk )(0, -VJk)t
u - j=1_
Ujk -	T
ZY, (j, k)
1=1
(15)
P(0|l) >P(0|2) (9) 3. If the improvement
P(0|l) - P(0|A) P(0|!)
< threshold (10)
Y ( j, k) -
a( j)P,( j)
N
Za (j)pt (j)
t=1
Ck N(0, ,jU,U)
M
Z cim N(0,, U,U)
(16)
then stop, otherwise put X instead of X and go to step 1.
In the GHMM case a gaussian mixtures density is a weighted sum of M component densities, given by the equation
M
b(0t / X) = z wjbj (0,) (11)
1=1
where ot (t = 1.......T) is a D-dimensional random
vector, bt (ot), i =1..............,M , are the component
densities and wi, i =1.............., M , are the mixture
weights. Each component density is a D-variate gaussian function of the form
bl(0')= (2n)D%jI7Texp{2()t((12)
ith mean vector ^ and covariance matrix Z,-. The mixture weights satisfy the constraint that ZMiw, = 1.
The complete Gaussian mixture density is parameterized by the mean vectors, covariance matrices and mixture weights form all component densities. These parameters are collectively represented by the notation x = {w,,z,}, = 1.............,M . The GMM can have the
several different forms depending on the choice of covariance. The model can have full or diagonal matrix. In this paper the full and diagonal covariance matrix are used for word recognition. In the GHMM, the Baum-Welch algorithm estimates the means and variances for the mixture of Gaussians
The Viterbi algorithm can be used to obtain the estimation of the most probable state sequence. Once all
the HMMs (X',X2,.............X) are correctly trained, to
classify a sequence for the observation O, Pw =P(0|Xw) is calculated for all the Xw The unknown observation O is then classified by the process:
w - argmax(pw) (17)
1<w<W
And so, w* is the optimum class for the observation
O.
The initialization and stop criteria must be chosen adequately for the HMM. It directly interacts on the relevancy of the HMM [11]. Equiprobable and equal occupancy methods for the initial models are provided as well as iteration and rate of the error for the stop criterion.
4 Dynamic time warping dynamic
The Dynamic Time Warping (DTW) distance measure is a technique that has long been known in speech recognition community. It allows a non-linear mapping of one signal to another by minimizing the distance between the two.
Dynamic Time Warping is a pattern matching algorithm with a non-linear time normalization effect. It is based on Bellman's principle of optimality [12], which implies that, given an optimal path w from A to B and a point C lying somewhere on this path, the path segments AC and CB are optimal paths from A to C and from C to B respectively. The dynamic time warping algorithm [12] creates an alignment between two sequences of feature vectors, (T1, T2,.....TN) and (S1, S2,....,SM).
A distance d(i, j) can be evaluated between any two feature vectors Ti and Sj . This distance is referred to as
t-11-1
m-1
ISOLATED WORDS RECOGNITION SYSTEM.
Informatica 30 (2006) 373-384 377
the local distance. In DTW the global distance D(i,j) of any two feature vectors Ti and Sj is computed recursively by adding its local distance d(i,j) to the evaluated global distance for the best predecessor. The best predecessor is the one that gives the minimum global distance D(i,j) at row i and column j:
D(i, j) = min [D(m,k)] + d(i, j) (18)
m<i,k< j
The computational complexity can be reduced by imposing constraints that prevent the selection of sequences that cannot be optimal [13]. Global constraints affect the maximal overall stretching or compression. Local constraints affect the set of predecessors from which the best predecessor is chosen. Dynamic Time Warping (DTW) is used to establish a time scale alignment between two patterns. It results in a time warping vector w, describing the time alignment of segments of the two signals. assigns a certain segment of the source signal to each of a set of regularly spaced synthesis instants in the target signal.
5	Overview of hybrid system in speech recognition
In order to overcome the unsatisfying performance of speech recognition systems based DTW, HMM or ANN, researchers have attempted to combine these methods. The majority of the researchers combine the models of Markov hidden HMM with the networks of neurons ANN. Several researchers have explored some hybrid system of HMMs and neural networks, the majority are constructed by sending the output of a neural networks to a HMM post processor [14,17] , several others propose a NN architecture that can emulated a HMM[18], alternatively [19] uses the NN to restore the N-best hypotheses produced with a HMM. In [14,16,20,21] the outputs of the NN are not interpreted as probabilities, but rather are used as scores and generally combined with dynamic programming. In [21,24] a network per class or per state is trained to predict the next input frame given only a few previous frames. Amore recent hybrid predictive system is proposed in [25], where network per word vocabulary is created and trained to predict the next input frame given the previous one, the predicted errors summed over all frames are used as a recognition score. In [26] we propose a method which extends the VQ distortion method by combining it with the likelihood of the sequence of VQ indices against a discrete hidden Markov model (DHMM). The scores have to be combined in such a way that the coherence of the two sources is maximized and their differences minimized.
In [27] we combine Hidden Markov Models of various topologies and Nearest Neighbor classification techniques using DTW algorithm.
6	New system DTW/GHMM
We combine HMM and DTW in a modeling framework. HMM can capture the statistical characteristics of word
and subword units among different speakers even in large vocabulary and thus is generally better than DTW in speaker independent large vocabulary speech recognition. However, there are useful applications of DTW in small vocabulary, isolated word, speaker dependent or multi-speaker speech recognition due to its relative simplicity and good recognition performance in these situations. DTW system can capture long-rang dependencies [1] in acoustic data, and can potentially adapt to differences in speaker, and accent [2]. According to the analysis above, DTW is effective in wide scale observation and HMM is suitable for solving the analysis of the detail. Hence it is feasible to devise a recognition system which combines these two methods.
The idea is to generate reference patterns for the words in the recognition vocabulary based on training data and then to aling all training data with them. The iterative algorithm was able to find a best reference templete that obtained over significant differences between training sets.
In the traditional HMM system each word is represented by a distinct HMM. In the training stage, each utterance is converted to the cepstral domain (MFCC features, energy, and first and second order deltas) which constitutes an observation sequence for the estimation of the HMM parameters associated to the respective word. The estimation is performed by optimizing the likelihood of the training vectors corresponding to each word in the vocabulary. Typically, the optimization is performed using the Baum-Welch algorithm or equivalently the EM (Expectation-Maximization) algorithm [1]. In the recognition stage, the observation sequence representing the word to be recognized is used to compute the likelihoods, for all possible models, that the sequence has been generated by these models. The recognized word corresponds to the one associated to the model with the highest likelihood. In this stage the Viterbi algorithm, is employed.
In the new system same steps are used as traditional system. In the training stage, feature vectors corresponding to the data samples are processed in order to generate a prototype pattern vector for each word. This is done by computing the centroid of the feature vectors associated to all the training occurrences of each word and then is used to normalize the training set using DTW algorithm to produce a stable set of clusters for which, a, the ratio of average intercluster distance to average intracluster distance was maximized.
In the recognition stage don't need to align recognized utterance by prototypes, because it's time consuming and return the recognition step very complex.
In the hybrid approach, HMM model for each digit, is generated as follows:
1-	Calculate the prototype template from the training data by iterative algorithm.
2-	Use the Dynamic Time Warping technique to align all the training data with the prototype template.
3-	Once he training data are aligned, then used to Train HMM model using Baum welsh algorithm.
374 Informatica 30 (2006) 378-384	E-H. Bourouba et al.
According to description above, the architecture of the traditional and new system are shown on the following figures (5,6).
Figure 5: Training Step. A- Hybrid system. B- Baseline system.
7 New system DTW/GHMM
The choice of the prototype templates will affect the performance of the recognition process. Two methods commonly used are to choose the cluster member that minimizes the distance to all other members of the cluster, or to simply average the members of the cluster. The advantage of the latter method is that it smooths out noise that may be present in any individual data item. Unfortunately, it is only workable when the cluster elements are embedded in a metric space (e.g. Cartesian space). Although we cannot embed cluster elements in a metric space, DTW allows us to use a combination of the two methods. The details of the algorithm are now presented as an iterative algorithm:
1-	First, we select the utterance from the training data that minimizes distance to all other utterances in a given cluster.
2-	Then we warp all other patterns into that centroid, resulting in a set of patterns that are all on the same time scale.
3-	It is then a simple matter to take the average value at each time point over all of the series and use the result as the cluster prototype.
7.1 Introduction
We assume that L finite sets Q, are given, with N, patterns each (repetitions of the same word). The set
L
of training data is q = Q q, .
,=i
Where Q, = \yt1,yl>2,...........ytNi}, L is the number
of words in vocabulary. yej is a pattern representing the
ith repetition of the Ith word. A pattern y is assumed to consist of F frames with P features each. If we denote the ith frame of y as y(i), then we can represent y as the set
of vectors y = {y(1), y(2)..........., y(F)}.
The MFCC feature, y(i), are computed from the MFCC coefficients by the relation (1). Since the iterative algorithm is based on distance data, a distance dj between patterns xt and yj and warping function are computed by :
[dj,w(t)]=DTW(Xi,yj) (19) Where x(t)=yj(w(t))
The function w(t) is the warping function obtained from a dynamic time warping (DTW) match of pattern Xi to yj, which minimizes the total distance over a constrained set of possible w(t).
A flow diagram of the iterative algorithm is given in figure 07
Figure 6 : Recognition Step.
ISOLATED WORDS RECOGNITION SYSTEM.
Informatica 30 (2006) 373-384 379
C0{ ,} is the word y tji such that the maximum distance to any another word in Q , is minimum. Since all distances of any word in Q, are computed and stored in D, minimax computations of the type given in Eq. (21) are especially simple to implement. These steps are given in Figure 08
Figure 7: Steps of iterative algorithm.
7.2 Proposed Iterative Algorithm
Assume that for Q, , =1,2....,L ,sets (word patterns),
the raw data yej = Q, {/'}, i=1,2,........ N , are to be
aligned with the center cluster C{ l } to product new training set W, . With the above definitions the proposed algorithm is described as follows:
7.2.1 Determination of the minmax center C0{ ,} of the observation set Q ,:
For each set Ql :
Compute a matrix of distance D:
[ D, (i, j) ,w(t)]=DTW( Q, ji}, Q, {{}) (20)
Compute cluster center C0{ l } using :
Co {}=Q i {}if
max 1 < m < N,
D(, m)is min (21)
Figure 8: The first step in the iterative algorithm.
7.2.2	Compute the new training set Wf
Align the patterns in each cluster Q t, to the length of the prototype C0 jl}. Replace all patterns ytJ i=1,2... Nt, with the corresponding warped patterns y l,i using
[d, w(t)]=DTW( Q, {/}, C0 jl}) (22)
yii = y,i (w(t)) , t=1,......F, (F, is the frames
number of C0 {, }). Where w(() is vector contain the indices frame which y, t = C0 {,}. So that in each set, the time length (number of frames) of all patterns become equal. Therefore is the new set:
¥<= j,1,>h,.....J,,N,} , W , [/'}= yj (23).
7.2.3	Compute the prototype cluster
Compute the Prototype cluster (cluster center) of the entire patters set W, . The cluster center is computed by averaging:
374 Informatica 30 (2006) 380-384
E-H. Bourouba et al.
, Nt Cg(1)=N^> (24)		Vocabulary
V1		(1-10) digits in French
7.2.4 Recomputed the new training set ^ Again align the patterns in each cluster Qt, to the length of the prototype CG{}. Replace all pattern yti	V2	(Adrar, Chlef, Laghouat, Oum el bouaghi, Batna, Bejaia, Biskra, Bechar, Blida, Bouira)
i=1,2... N , with the corresponding warped patterns xfJ using [d, w(t)]=DTW(yti, CG{l}) (25)	V3	(Tamanrasset, Tebessa, Tlemcen, Tiaret, Tizi ozou, Alger, Djelfa, Jijel, Setif, Saida
xti = yti(w(t)) , t=1, Ft (Ft is the frames number of CG {l}). Where w(t) is vector contain the	V4	(Skikda, Sidi Bel Abbes, Annaba, Geulma, Constantine, Medea, Mostaghanem, Msila, Mascara, Ouargla)
new set:
^t = {xt,i
x 11, xt
J (23).
8 Experimental Evaluation
This section presents the experimental evaluation of GHMM and DTW/GHMM approaches for spoken word recognition. Two databases (French and Arabic) were used for the training and testing. The first database is the Digits Corpus from the National Laboratory of Automatic and Signals in The University BADJI-MOKHTAR Annaba Algeria. The data is sampled at 10 KHZ sampling rate and digitized to 8-bit resolution. A subset of the database used in our experiments comprised a small vocabulary spoken by 10 speakers (8 males and 2 females) and test data spoken by 15 different speakers (11 males and 4 females). There are utterances 300 in the training sequence and 600(400 (4 for each training speaker) + 200(4 for another speaker)) in the testing sequence. The second database comprises 48 isolated Arabic words is sampled at 10 Hz and digitized to 8-bit resolution. Here we used only a subset of 10 words. There are 10 speakers (2 male and 8 female) in the database and each word was repeated 5 times by the Speakers. The three first one repetitions were used as the training set and the rest as the testing set. There are 1440 utterances in the training sequence and 960 in the testing sequence. The feature extraction procedure for both databases is the same.
The vocabulary to be recognized is composed by the ten French utterances of the digits from zero to nine and ten Arabic utterances of the states name in Algeria as following:
Notice: The words signals were recorded in a room without any special acoustic protection. Repetitions from one speaker were done in different days with a different type of microphone).
For comparison purposes, we have been using systems based on different kinds of acoustic feature:
D1=(12 MFCC)
D2=(12 MFCC + E)
D3 = ((12 MFCC + E)+ A
D4 = ((12 MFCC + E)+ A+ AA
8.1 Results and Discussion
The tables below shows the various results obtained for the two developed systems of traditional GHMM recognition and hybrid (DTW/GHMM) applied to the different vocabularies:
Tablel: the tests results with 1 mixture and 5 stats
	Traditional system			
	D1	D2	D3	D4
V1	75.00	80.25	85.00	86.25
V2	56.66	58.66	65.33	67.33
V3	75.33	81.33	84.00	84.66
V4	70.00	76.00	79.33	82.00
V5	78.00	79.33	84.60	86.66
G
x
t,N
ISOLATED WORDS RECOGNITION SYSTEM.
Informatica 30 (2006) 373-384 381
Table2: the tests results with 1 mixture and 5 stats
Hybrid system
	D1	D2	D3	D4
V1	77.00	91.25	92.25	93.50
V2	60.00	65.33	76.66	83.33
V3	83.33	86.00	86.66	88.33
V4	76.66	80.00	84.60	85.33
V5	82.00	85.33	86.00	90.00
Table 1 and 2 report respectively test results of conventional HMMs and DTW/HMM algorithm, where HMMs have 5 states and 1 mixture components for different vector coefficients (D1...D4) and vocabulary's(V1.. ..V5).
The variation of performance raised about 2-10 % between the system GHMM and GHMM/DTW are observed for the registered test set. From the experiments above, we know that DTW/GHMM has better performance than conventional HMMs.
The main advantages of our method are the following:
❖ Our experiments show, that the alignment of two sequences of the same word with respect to its class prototype result in a decrease of the distance between the two sequences before being aligned (figure 09).
Let xj and x2 e Qi
d1<d2
d1=DTW( xi , X2 )
d2=DTW( ~ , ~2 ) with{[d,wi(i)]=diw(xi,Cg(1)} ~1= Xl(M(0)
{{(0]=dW(X2,Cg(1)}
~2 = x2(w2(t))
Figure 9: distance and time warping function of two utterances of the same word before and after alignment
❖ On the other hand the fact of aligning two different word sequences with their respective class prototype has the effect of increasing the distance between the two sequences after being aligned (figure 10).
Let xeQii and yeQi2
d1>d2
d1=DTW( x, y )
d2=DTW( x, ~ ) with{d,wi(t)]=dtw(x ,Cg(1i)} ~=x(wi(t)) {[d,W2(t)]=dtw(y,Co(12)} 37=y(W2(t))
Figure 10: distance and time warping function of two different patterns before and after alignment
❖ Combing the two previous properties results increase of intra-cluster correlation and a good inter cluster discrimination. In addition the time alignment of two training sets X1 and X2 representing two different words produces two new training sets Y1 and Y2 that are easily discriminated. As a consequence Y1 and Y2 are more effective differentiating HMM models A\, A2.
Time-warping all the utterances in the training set (cluster) to the same duration as a central template is used to improve the training process. The time-normalized utterances improve the ability of the baum welsh algorithms to learn the data, because the average intercluster distance to the average intracluster distance is maximized after alignment of the training sequences with respect to the prototype, this favors the discrimination of models training sequences which result a discrimination of the models. On the other hand, the fact the classes have their own sizes after the alignment will increase the discrimination of the models particularly at the transition matrix A level. This is especially important for words which are phonetically close to each other.
9 Conclusion
This paper presents the new DTW/GHMM system in isolated speech, where classical DTW and HMM is combined. In the training stage we define the prototype set and introduce iterative algorithm as the solution to build the best prototypes which favors the discrimination between the training sets to give discriminates models in the vocabulary space. The experiments show that the DTW/GHMM system increases the average recognition rate by 2-10% more than the HMM-based recognition method. Though the methods proposed in this paper got better performance, there are still some issues to be further investigated. If explicit effective features can be extracted, the recognition may have a better performance. It is a challenging issue that deserves further study.
374 Informatica 30 (2006) 382-384
References
[1]	L. Rabiner, and B.H.Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, New Jersey, 1993.
[2]	M. De Wachter, K. Demuynck, "Data driven example based continuous speech recognition" in Proc. Eurospeech, Geneva, Switzerland, Semptember 2003
[3]	Q. Zhu, A. Alwan, "On the use of variable frame rate analysis in speech recognition", Proc. IEEE ICASSP, Turkey, Vol. III, p. 1783-1786, June 2000.
[4]	J. A. Sánchez, C. M. Travieso, I. G. Alonso, M. A.Ferrer, Handwritten recognizer by its envelope and strokes layout using HMM's, 35rd Annual 2001 IEEE Internacional Carnahan Conference on Security Technology, (IEEE ICCST'01), London, UK, 2001, 267-271.
[5]	M. A. Ferrer, J. L. Camino, C. M. Travieso, C. Morales, Signature Classification by Hidden Markov Model, 33rd Anual 1999 IEEE Internacional Carnahan Conference on Security Technology, (IEEE ICCST'99), Comisaría General de Policía Científica, Ministerio del Interior, IEEE Spain Section, COIT, SSR-UPM, Seguritas Seguridad España S.A, Madrid, Spain, Oct. 1999, 481-484.
[6]	Renals, S., Morgan, N., Bourlard, H., Cohen, M. & Franco, H. (1994), Connectionist probability estimators in HMM speech recognition, IEEE Transactions on Speech and Audio Processing 2(1),1994, 161-174.
[7]	L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer, Maximum mutual information estimation of HMM parameters for speech recognition,. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, , Tokyo, Japan, December 1986,49-52
[8]	L. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probablistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1), 1970, 164-171.
[9]	L. Baum, An inequality and associated maximization technique in statistical estimation for probalistic functions of Markov processes. Inequalities, 3, 1972, 1-8.
[10]	L. R. Rabiner. Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition Readings in Speech Recognition, chapter A, 1989,267-295.
[11]	M.A. Ferrer, I. Alonso, C. Travieso, "Influence of initialization and Stop Criteria on HMM based recognizers" , Electronics letters of IEE, Vol. 36, June 2000, 1165-1166.
[12]	R. Bellman and S. Dreyfus, "Applied Dynamic Programming". Princeton, NJ: Princeton University Press, 1962.
[13]	H. Silverman and D. Morgan, "The application of dynamic programming to connected speech
E-H. Bourouba et al.
recognition" IEEE ASSP Magazine, vol. 7, no. 3, pp. 6-25, 1990.
[14]	Y. Bengio, "Artificial Neural Networks And Their Application To Sequence Recognition" PhD Thesis, McGill University, Montreal, cannada, 1991
[15]	Bourlard H. and Wellekens C. J., "Speech Pattern Discrimination and Multilayer Perceptrons," Computer Speech and Language, vol. 3, pp. 1-19, 1989.
[16]	P.Haffer M. Franzini A.waibel "Integrating Time Alignment And Neural Networks For High Performance Continuous Speech Recognition" Proc of the ICASSP' 91, pp.105-108, Torento, 1991.
[17]	Morgan N. and Bourlard H., "Continuous Speech Recognition Using Multilayer Perceptrons with Hidden Markov Models," in Proceedings of IEEE ICASSP, vol. 2, pp. 26-30, Albuquerque, 1990.
[18]	J.S. Bridle "Training Stochastic Model Recognition Algorithms As Networks Can Lead To Maximum Mutual Information Estimation Of Parameters" Advances in Nips (ed. D.s. Toniesky), Morgan Kaufmann Publ., pp.211-217,1990.
[19]	G. Zavaliagkos, Y. Zhao, R. Schwartz and J Makhoul, "A Hybrid Segmental Neural Net/Hidden Markov Model System for Continuous Speech Recognition," IEEE Transactions on Speech and Audio Processing, vol. 2, no. 1, pp. 151-160, 1994.
[20]	X. Driancourt, L. Bottou, and P. Gallinari, "Learning Vector Quantization Multilayer Perceptron and Dynamic Programming: Comparison and Cooperation," in Proceedings of the International Joint Conference on Neural Networks, IJCNN, vol. 2, pp. 815-819, 1991.
[21]	J. Tebelskis, A. Waibel, B.Petek, O.Schmidbauer, "Continuous Speech Recognition Using Linkeed Predictive Network" Advances in Neural Information Processing Systems 3, Eds Lippman, Moody and Touretsky, Publ. Morgan Kaufman, pp.199-205,1991.
[22]	M .Franzini, K.F. Lee , and A. Waibel, "Connectionist Viterbi Training: A New Hybrid Method for Continuous Speech Recognition," in Proceedings of ICASSP, Albuquerque, NM, pp. 425-428, 1990.
[23]	L. T. Niles , H. F. Silverman, "Combining Hidden Markov Models and Neural Networks classifiers," in Proceedings ICASSP, pp. 417- 420, Albuquerque, NM, 1990.
[24]	E. Levin, "Word Recognition Using Hidden Control Neural Architecture," in Proceedings ICASSP, Albuquerque, NM, pp. 433-436, 1990.
[25]	R. Djemili, M. Bedda, H. Bourouba "Recognition Of Spoken Arabic Digits Using Neural Predictive Hidden Markov Models" International Arab Journal on Information Technology, IAJIT, Vol.1, N°2, pp. 226-233, July 2004.
[26]	M. N. Do and M. Wagner, "Speaker recognition with small training requirements using a combination of VQ and DHMM" , Proc. of Speaker Recognition an
Informatica 30 (2006) 383
JOŽEF STEFAN INSTITUTE
Jožef Stefan (1835-1893) was one of the most prominent physicists of the 19th century. Born to Slovene parents, he obtained his Ph.D. at Vienna University, where he was later Director of the Physics Institute, Vice-President of the Vienna Academy of Sciences and a member of several scientific institutions in Europe. Stefan explored many areas in hydrodynamics, optics, acoustics, electricity, magnetism and the kinetic theory of gases. Among other things, he originated the law that the total radiation from a black body is proportional to the 4th power of its absolute temperature, known as the Stefan-Boltzmann law.
The Jožef Stefan Institute (JSI) is the leading independent scientific research institution in Slovenia, covering a broad spectrum of fundamental and applied research in the fields of physics, chemistry and biochemistry, electronics and information science, nuclear science technology, energy research and environmental science.
The Jožef Stefan Institute (JSI) is a research organisation for pure and applied research in the natural sciences and technology. Both are closely interconnected in research departments composed of different task teams. Emphasis in basic research is given to the development and education of young scientists, while applied research and development serve for the transfer of advanced knowledge, contributing to the development of the national economy and society in general.
At present the Institute, with a total of about 700 staff, has 500 researchers, about 250 of whom are postgraduates, over 200 of whom have doctorates (Ph.D.), and around 150 of whom have permanent professorships or temporary teaching assignments at the Universities.
In view of its activities and status, the JSI plays the role of a national institute, complementing the role of the universities and bridging the gap between basic science and applications.
Research at the JSI includes the following major fields: physics; chemistry; electronics, informatics and computer sciences; biochemistry; ecology; reactor technology; applied mathematics. Most of the activities are more or less closely connected to information sciences, in particular computer sciences, artificial intelligence, language and speech technologies, computer-aided design, computer architectures, biocybernetics and robotics, computer automation and control, professional electronics, digital communications and networks, and applied mathematics.
The Institute is located in Ljubljana, the capital of the independent state of Slovenia (or S9nia). The capital today is considered a crossroad between East, West and Mediter-
ranean Europe, offering excellent productive capabilities and solid business opportunities, with strong international connections. Ljubljana is connected to important centers such as Prague, Budapest, Vienna, Zagreb, Milan, Rome, Monaco, Nice, Bern and Munich, all within a radius of 600 km.
In the last year on the site of the Jožef Stefan Institute, the Technology park "Ljubljana" has been proposed as part of the national strategy for technological development to foster synergies between research and industry, to promote joint ventures between university bodies, research institutes and innovative industry, to act as an incubator for high-tech initiatives and to accelerate the development cycle of innovative products.
At the present time, part of the Institute is being reorganized into several high-tech units supported by and connected within the Technology park at the Jožef Stefan Institute, established as the beginning of a regional Technology park "Ljubljana". The project is being developed at a particularly historical moment, characterized by the process of state reorganisation, privatisation and private initiative. The national Technology Park will take the form of a shareholding company and will host an independent venture-capital institution.
The promoters and operational entities of the project are the Republic of Slovenia, Ministry of Science and Technology and the Jožef Stefan Institute. The framework of the operation also includes the University of Ljubljana, the National Institute of Chemistry, the Institute for Electronics and Vacuum Technology and the Institute for Materials and Construction Research among others. In addition, the project is supported by the Ministry of Economic Relations and Development, the National Chamber of Economy and the City of Ljubljana.
Jožef Stefan Institute
Jamova 39, 1000 Ljubljana, Slovenia
Tel.:+386 1 4773 900, Fax.:+386 1 219 385
Tlx.:31 296 JOSTIN SI
WWW: http://www.ijs.si
E-mail: matjaz.gams@ijs.si
Contact person for the Park: Iztok Lesjak, M.Sc.
Public relations: Natalija Polenec
Informatica 30 (2GG6)
INFORMATICA
AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS
INVITATION, COOPERATION
Submissions and Refereeing
Please submit three copies of the manuscript with good copies of the figures and photographs to one of the editors from the Editorial Board or to the Contact Person. At least two referees outside the author's country will examine it, and they are invited to make as many remarks as possible directly on the manuscript, from typing errors to global philosophical disagreements. The chosen editor will send the author copies with remarks. If the paper is accepted, the editor will also send copies to the Contact Person. The Executive Board will inform the author that the paper has been accepted, in which case it will be published within one year of receipt of e-mails with the text in Informatica LTeX format and figures in .eps format. The original figures can also be sent on separate sheets. Style and examples of papers can be obtained by e-mail from the Contact Person or from FTP or WWW (see the last page of Informatica).
Opinions, news, calls for conferences, calls for papers, etc. should be sent directly to the Contact Person.
QUESTIONNAIRE
I I Send Informatica free of charge I I Yes, we subscribe
Please, complete the order form and send it to Dr. Drago Torkar, Informatica, Institut Jožef Stefan, Jamova 39, 1111 Ljubljana, Slovenia.
Since 1977, Informatica has been a major Slovenian scientific journal of computing and informatics, including telecommunications, automation and other related areas. In its 16th year (more than ten years ago) it became truly international, although it still remains connected to Central Europe. The basic aim of Informatica is to impose intellectual values (science, engineering) in a distributed organisation.
Informatica is a journal primarily covering the European computer science and informatics community - scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international refereeing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations.
Editing and refereeing are distributed. Each editor can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the Refereeing Board.
Informatica is free of charge for major scientific, educational and governmental institutions. Others should subscribe (see the last page of Informatica).
ORDER FORM - INFORMATICA
Name: .................................................... Office Address and Telephone (optional):
Title and Profession (optional): ....................................................................
........................................................... E-mail Address (optional): .............
Home Address and Telephone (optional): ....................
........................................................... Signature and Date: ....................