MAN - MACHINE COMMUNICATION: SPEAKER - INDEPENDENT SPEECH RECOGNITION INFORMATICA1/88 UOK 681.3:534.44 Zdravko Kačič Bogomir Horvat Štefan Greif Facuity of Technical Sciences, Mlaribor Abstract. With a proper selection o-f -feature descri su-f-ficlent accuracy o-f the speaker - independent spee should be achieved. The speech signal -features are descr three sets ai -feature ( the set o-f descriptive -features selected -features, and the set o-f characteri stic -fea •feature description methods are described with the t map ( the set o-f descriptive -features map, the set of sel map , and the set of characteristic features map ). two feature description methods are dismembered - zero - c ( variant a and b ) and method of formaht frequencies ( variant a and b ). It has been shown that the Fourier as a map of descriptive features was more convinient as of interval lenght between two succesive zero-crossings o The mapping rule in variant b of the method of forman energv classes was more convinient map jof selected feat mapping rule in variant a. With these moife convinient map feature overlapping and conseguentlv a jbetter average accuracv ( greater than 92.5X ) has been achieved. ption methods ch recognition ibed with the , the set of tures ). The hree sets of ected features As an examplB rossing method energv classes transformati on a measurement f the signal. t freguencies ures than the s the smallest recogni ti on Keywords. Speech recognition, independent speaker, recognition base element, set of features, set of maps, recognition accuracv, feature description, feature overlapping. i 1. Introduction In spite of fast develgpment of computer tecnologv, digital signal processing theory, phonetics , linguistics and artiffical intel1igence, solution of the problem regarding man - machine communication on the basis of speaker-independent speech recognition, remains entirelv the job of the feature. To solve thig problem a very good knoMledge of ali above mentioned fields shal1 be reguired . Nowadays commercial speech recognition systems recognize successfully a large vocabulary of words onlv in the čase of isolated wDrd recognition and are mostly dependent on speaker CIOD. In systems which recognize connected speech or even continuous speech the vocabularv of words is much smaller. A special recogni ze signal. problem represent svstems the speaker-independent which speech In svstems which recognize isolated words the extent ' of vocabularv decreases already ( on about 40 words ). Df course, the same recognition accuracv as in the speaker- dependent svstems shall be reguired. Todav the speaker-independent continuous speech recognition systems exist as protbtvpes onlv and their vocabularv is not greater than 10 words CIOD. The 'complei, (5) D' - the set of descriptive features described by the i-th de^cription D'=, (b) D'r, - the set of descriptive features of the n-th recognition base element described by the i-th description D* =-f D* * n* 3 ,D»" >, (7) D""r. - the set of descriptive features of the m-th articulation of the j-th recognition base element -^D^'"„x,0^• D""„,, ... D"-r.L.>, (8) 2. Description of recognition base element features We shall try to describe feature extraction process by means of three sets bf speech signal features and three sets of map. The three sets of features &rE : the set of ali descriptive features, the set of ali selected features and the set of ali characteristic features. Each of the sets should be mapped with the following mapping sets : the set of descriptive features maps, the set of selected features maps and the set of characteristiC features maps. L - the number of windows of the m-th articulation of the j-th recognition base element D*",,! - the set of descriptive features of the 1-th window of the m-th articulation of the n — th recognition base element . ,d""„iw, the number of descriptive features c) Set of ali selected features - S d*"'„XK>, (9) Such distribution of speech signal features has been assumed to estimate the convinience of a single feature extraction method which shall be used in the feature extraction process. S={S»,3=, ,S^, ... >, (10) S-"- the set of selected features defined by the j-th description Analytic evaluation of the importance of a single feature description and with it a definition of 'optimum' description might also be possible. Let us describe now briefly single sets of features and the sets of maps. A> Sets of features a) Set of the recognition base elements articulation - A . •A =, (1) N - the number of different recognition base elements S-' = , (11) N - the number of different recognition base elements S-'„ - the set of selected features of the n-th recognition base element defined by the j-th description SJ„ = -CS-'„.,SJ ,3'r >, (12) S-^rip - the p-th selected feature of the n-th recognition base element defined by the j-th description S-'„„ = >, (13) ^"=, (2) Ar,m -the m-th articulation of the n-th recognition base element R - the number of elements of the p-th selected feature A„„=-Ca,- ->, (3) d) Set of ali characteristic features - C L - the number of windaws of the m - th arti culati on anmi - the 1-th .window of the m-th articulation' of the n-th recognition base element ar.^nl^-Va f-tn>lfa i-imlf ... a nmlf ... a f-itnl-', \H ) U - the number of elements in the 1-th window C—-CC , Ca 1 ... , 1-" , >, (14) C" - the set of characteristic features defined the u-th description (15) N - the number of different recognition base elements C"r. - the characteri Stic -feature of the n-th recognition base element de+ined by u-th descri pti on ... c"„«>, (16) elements o-f the V - the number of characteri sti C -feature B) Sets o-f maps 1) Set o^ descriptive feature maps - Fr — elements of the set are mapping the set of recognition base elements articulation into the set of descriptive features To—-CfDi,fr>2i ... ffoif ... 5-, fo» :A -* D' (17) (18) The map fnitji • D' is surjective. 2) Set of selected feature maps - G» - elements of the set are mapping the set of descriptive features into the set of selected features B.=, (19) g«j:D* • S-" (20) The map g.jiD' • S-* is surjective. 3. An e»ample of feature extraction method di smembers Considering the maps and sets mentioned below, as an example the two feature extraction methods shall be dismembered. The first one is the so called sero-crossing method (method from the tirne domain ) and the second one is the method of formant freguencies ener-gy classes ( frjequency domain ). There are various variants of the zero-crassing methjod C5]. Almost ali have in common the mapping rule of Idescriptive features , i.e. measuring the tirne betneen the twa successive zero-crossings of a si gnal. i Single variant 'evaluates' these intervals in diffierent ways. I We shall briefly describe two of them. Elements of the descriptive features set are de.fined as: J: T_ 1,2, (23) where! T„ is the tirne between two successive samples j is the number of samples with egual sign d^ is the lenght of k-th interval K is the number of intervals In this way , the set of descriptive features D""„j is composed of subsets which contain lenght of intervals between two successive zeroperOSEi ngs. ,!i=-Cd' .,d*"r,i ,d" (24) Variant a (ZCa) 3) Set Of characteristic feature maps - Se - elements of the set are mapping the set of descriptive features into the set of characteri Stic features Elements of the selected features set S'r>i. are defined as: B^„l,(Xj,X^^,) d(TJ,TJ.i) (25) , g«n , ... iT 1 (21) (22) -• C" is surjective. Bc = is not an empty set. That means, that the elements of characteristic features set C" sire disjunctive sets . This is not val i d for the elements of selected features set S-". If- fDi:J» -» D' and -• C" are maps, then we may compose foi and g^u to obtain a map fo** g€=u;Jt • C'-'. We shall define such maps , which are mapping the set of recognition base elements articulation into the set of characteristic features. where: -d(tij,Tj»,) is the number of intervals in the tirne class (Tj,Tj*i> (26) Value of P is defined by P= i E d(TJ,tj..) , 1 K is the number of ali intervals . The subset S'r,c. of selected features set S^o is composed of elements which represent portion of interj-vals lenght in particular tirne classes. = {s' variAnt b (ZCb) .>, (27) Secondly , elements of the selected features set S''„_ are defined as follows: B'-„J.iX\,,X\,^,)-- n ( TJ , tj -1 > . < tj + tj ^ . ) /2 W . ( TJ .. . -TJ ) (2B) where n < r J , r J *, ) W is the number o-f intervale in the tirne class (Tj,rj*i) is the window width are the boundary values a-f the j-th tirne class. By means o-f -factors {Xj,Xj^i) /2 and Ctj-n- Tj) a better evaluation o-f high and low -frequency components should be achieved. , = C29) b. Method o+ formant -frequencies energy classes (FFEC) Like the zer-o-crossi ng method this method knows various variants as well. AH variants use the discrete Fourier transf ormation as the mapping rule o-f the descriptive -features C6,9] ; G exp (-j2ltsu/U) (30) The subset D-""„i_ of the descriptive -features set D->"'r, is composed of •frequency samples. D->'"„i=, (31) Variant a (FFECa) To de-fine elements o-f the selected features set S*r.B the following prescription has been used: s*„^(r) wher-e G^(u) s*„„(m) R* K R fn.-4-l/R-» X = ( Z log G^=»(u))/( Z log Gv-= (u> > j u=f„/R* "-' r- = l,2 R (32) is the u-th element in descriptive features set D-""r>i is the m - th element in selected features set S'',.« is the resolution factor of DFT is the number of ali elemente in the descriptive features set D-^^ni is the number of elements in the selected features set S'^r.p Are the boundary values of the m-th formant frequencies class The subset of the selected features set S^r,^ is composed of elements , which represent a porti on of maximum frequency components in a single formant frequencies clasB. = , <33) Variant b (FFECb) This variant defines elements of the selected features set S^r.„ as follows: M s-„„(j)=log G^„.„=(j)/( Z log G^™.>,= (m)>, (34) m—1 where G>,„.«=(j) is the maximum frequency camponent in j-th formant freguencies class M is the number of maximum components of aH classes and the number of elements in the selected features set S^„p k E I 0 U « 36.1 O.O 0.0 6.1) 1.9 Zero variant a E 1.8 71.8 5." 2U.5 11.8 I 0.0 13.7 90.9 0.9 11.6 - Crossing Method variant b recognlzed as \i] 0 U 1.8 0.0 5.5 9.0 0.0 3.7 62.7 5.5 10.9 63.6 A E 97.3 0.9 0.9 78.2 0.0 3.7 7.3 22.7 2.7 11.8 I 0.0 10.0 91.5 0.0 17.3 0 1.8 7.3 0.0 63.6 10.9 U 0.0 3.6 1.8 6.1 57.3 10 A Z 1 0 u « 33.6 6.3 3.6 13.8 5.5 Hethod of Foroant Frequenoles E:nepgy Classea variant a E 1.8 70.0 16. 6 »•5 7.2 I 0.0 16.1 69.0 1.8 1.8 variant b recoitnlzed aa [%] 0 U 8.2 6.4 2.7 1.6 3.6 7.2 58.1 21.8 A E I 97.3 0.0 0.0 0.0 92.8 5.1 O.O 11.5 92.8 1.8 0.0 0.0 10.1 75.1 1 0.0 0.9 0.0 0 2.7 0.9 0.0 92.8 10.0 U O.C • 0.? 2.7 5.1 89.1 Table la-b: Experin>ental results o-f five isolated Slovene vowels recognition 5. Ef-ficiencv o-f feature extractiDn methods We ehal J now try to estimate e-f-ficiencv o^ single maps, or better, their 'conveni ence' -for the use in the base elements recognition process on the basis o-f recognition results. By using map rules in the zero-crossing method ( variant a > a somehaw better recognition accuracy was achieved onlv -for the vowel /a/ ( 96.4* ) - less for the vowel /i/. For the vowels /e/ , /o/ and /u/ a rather worse recognition accuracy was achieved. The variant b of . the zero-crossing method showed a little bit better recognition results, but the rate o-f vowels recognition error Mas rather the same as at the variant a. The reason for a Morse recognition accuracy when zero-crossing method was applied , should be searched in the usage of the map of descriptive features. In this method ( for both variants ) the measurement of interval s lenght as mapping rule for mapping the descriptive features was used. Anyhow, this 'function' is 'incapable' to 'ignore' phase changes between particular frequency components in a signal. In other words function. it is a phase dependent Human ear is insensitive to phase changes in a speech signal C43, whereas this is not true for the 'simple' measurements of intervals lenght. Two signals with egual frequency components and with different phases sound the same. However, they can be formed in very different subsets of descriptive features , if the rule of the measuring interval lenght betmeen the two successive zero—crosslngs of the signal was used as the mapping rule. This is of great importance for phase changes at low frequencies (fIrst two formants), which have ussualy the greatest amplitude and as such a greater influence on the zero-crossing rate. Fig^la •^.•lo-^'^ t-ne rirst three feature vector formed by method (variant a) and th freguencies energv classes ali articulations of the 'describe' low freguencies spectrum. Fig. Ib represent elements of the feature articulations of the vowel / methods. They describe high frBquency spectrum. elements of the the zero-crossing method of formant (variant b) for vowel /e/. They in the frequency the last three vector for ali e/ , for the both freguencies in the It could be noticed, that the dispersion of the first three elements of the feature vector fornied by the ZCa method (marked by •*•) , is much greather than the dispersion of the feature vect'or elements formed by the FFECb method (they are labeled as . >. ZC» FFECb '^ ZCi . FFEa Fig.|la-b : Distribution of the first three a) and the last three b) elements of the feature vector, for vowel /e/, formed bv ZCa C^) and FFECb (.) methods. 11 A rather smaller dispersion could be seen at the last throe elements o^ the .eature vector •formed by the ZCa method. m the both cases the dispersian of ^^*ture vectors elements formed by the FFECb is very similar. From the above mentioned the in,portance o+ the ^act o" phase changes bet^een single ^r-equency companents in the frequency spectrum -^^^t be noticed - first oi ali , for low frequencie6 Ce^ng Present in', a speech signal oi an independent speaker. This fact alsa indicates the recognition resClt/af the vo^els /o/ and /u/, for which ftrst of ali the first -formant is dominant. From the Fig. 2a it can be -1=° =^^"' ^^l%l features description of ^^^°'3"^^^°^„„ht as elements «ith measurement of interval l^^^^t as mapping rule of descriptive features was less successful as with Fourier transformation This should be evident from the dispersion rate of single feature vector elements , which is greater than the one for the other t«o m^thods. This iS particulary true for the second and the third element °^. J-^lJjfr^'^1 vector (they first of ali describe the first formant). Comparision of recognition results for variante FFECa and FFECb ( see Table Ib ) and considerations of dispersion rates of vector elements for both variants (Fig. 2 b - ='9^^^ indication of the fact that common normalired energv of single formant frequencie5 classes calculated by this variant was a 'worser criteria' than the r.atio of normalized energy of maximum components was. This might point out that the common energy contents psr single formant freguencies classes for some recognition element change with an independent speaker. It was reflected as an increase of dispersion for almost ali elements of tne feature vector ( Fig. 2b ). This means a worse recognition accuracy < Table Ib ). 'FFECa method' .025 .05 .075 125 .15 .175 .2 .225 .25 'FFECb method' 'ZCa method' Fiq. 2a-c : Histograms of the feature vector elements , for vowel /e/, formed by ZCa a), FFECa b) and FFECb c) methods. . A better recognition accuracy and the smallest features vector elements dispersion waB achieved when the mapping rule of method FFtCD Mas used. The mapping rule^of the selected features for this variant 'enables selection' of frequency components. In each class only the maximum component was choosen. In this way only energy of the masimum component for a particular class wa5 described. But because of the fact that ten formant freguencies classes were defmed, they are not aH maximum frequency components o+ •f ormants. With this accuracy variant the best average recognition ,ias achieved ,- greater than 92.5 X. 12 6. Conclusion By the speaker-independent speech recognition such -features maps should be de-fined that ' di-f-f erences' "in speech -features, appearing in the čase o-f an i ndependent speaker shall be expressed as small as possible. That means that such functions should be de-fined Mhere features overiapping was as small as possible. This should be valid -for maps o-f descriptive -features < e.g. measurement o-f interval« lenght - discrete Fourier trans-formation ) and for maps of selected features ( e.g. variant a - variant b of FFEC method > as wel 1 . The mapping rules discussed in our paper showed that the discrete Fourier transformation as the mapping rule for the descriptive features maps and the variant b of the FFEC method as the mapping rule for the selected features maps gavB the best recognition results. Uith above mentioned methods the smallest features overiapping and consequently the best average recognition accuracy has been achieved - i. e. more than 92.S X . References tli L.R. Rabiner and f*. W. Schafer , Digital Processing of Speech Si gnal s, Prentice - -Hali , Englewood Cliffs , NJ , 1978. C23 A. H. Seidman and I. Flores , Handbook of Computers and Computing , Van Nostrand Reinhold Company , New York , 1984. [31 R. De Mori and C.Y. Suen , New Systems and Arhitectures for Automatic Seech Recognition and Synthesis , Springer - Verlang, Berln, 1985, Chap. 1, pp. 1 - 72 . C4D James C. Anderson , "Improved zero-crossing method enhances digital speech " , EDN Magazine , vol. 27, No. 20 , october 13 1982 , pp. 171 - 174 . C53 R.J. Niederjohn and P.F. Castelaz, "Zero - Crossing analysis methods and their use for automatic speech' recognition " ,Proc. IEEE Computer Society Morkshop on Pattern Recognition and Artifical Intel 1igence, 1978 , pp, 274 - 281 . [63 F. Fallside and W.A. Woods, Computer speech processing , Prentice - Hali , Englewood Cliffs , NJ , 1985 C7D J. C. Simon, Spoken Language Generation and Understanding,D. Reidel Publishing Company, 19B0 , pp. 129 - 145 CSO R.J. Senter,Analysis of Data, Scot,Fore5man and Campany,111inois , 1969 . C93 I. H. Witten, "Digital storage and analysis of speech", Wireless world, november 1981, pp. 44 - 48 . C103 P. Willich, "Putting speech recognizers to work" , IEEE Spectrum , april 1987 , pp. 55 - 57 . cm Z.Kačid, ž.Breif and B.Horvat, "Uspešnost metod opisovanja skupnih značilnosti osnovnih elementov govornega signala", Elektrotehniški vestnik , Vol. 53 (1986), No. 3, pp. 121 - 129 .