MAN - MACHINE COMMUNICATION: 
SPEAKER - INDEPENDENT SPEECH 
RECOGNITION 
INFORMATICA1/88 
UOK 681.3:534.44 
Zdravko Kačič 
Bogomir Horvat 
Štefan Greif 
Facuity of Technical Sciences, Mlaribor 
Abstract. With a proper selection o-f -feature descri 
su-f-ficlent accuracy o-f the speaker - independent spee 
should be achieved. The speech signal -features are descr 
three sets ai -feature ( the set o-f descriptive -features 
selected -features, and the set o-f characteri stic -fea 
•feature description methods are described with the t 
map ( the set o-f descriptive -features map, the set of sel 
map , and the set of characteristic features map ). 
two feature description methods are dismembered - zero - c 
( variant a and b ) and method of formaht frequencies 
( variant a and b ). It has been shown that the Fourier 
as a map of descriptive features was more convinient as 
of interval lenght between two succesive zero-crossings o 
The mapping rule in variant b of the method of forman 
energv classes was more convinient map jof selected feat 
mapping rule in variant a. With these moife convinient map 
feature overlapping and conseguentlv a jbetter average 
accuracv ( greater than 92.5X ) has been achieved. 
ption methods 
ch recognition 
ibed with the 
, the set of 
tures ). The 
hree sets of 
ected features 
As an examplB 
rossing method 
energv classes 
transformati on 
a measurement 
f the signal. 
t freguencies 
ures than the 
s the smallest 
recogni ti on 
Keywords. Speech recognition, independent speaker, recognition base 
element, set of features, set of maps, recognition accuracv, feature 
description, feature overlapping. i 
1. Introduction 
In spite of fast develgpment of computer 
tecnologv, digital signal processing theory, 
phonetics , linguistics and artiffical 
intel1igence, solution of the problem regarding 
man - machine communication on the basis of 
speaker-independent speech recognition, remains 
entirelv the job of the feature. 
To solve thig problem a very good 
knoMledge of ali above mentioned fields 
shal1 be reguired . 
Nowadays commercial speech recognition systems 
recognize successfully a large vocabulary of 
words onlv in the čase of isolated wDrd 
recognition and are mostly dependent on speaker 
CIOD. In systems which recognize 
connected speech or even continuous speech the 
vocabularv of words is much smaller. 
A special 
recogni ze 
signal. 
problem represent svstems 
the speaker-independent 
which 
speech 
In svstems which recognize isolated words the 
extent ' of vocabularv decreases already ( on 
about 40 words ). Df course, the same 
recognition accuracv as in the speaker-
dependent svstems shall be reguired. 
Todav the speaker-independent continuous speech 
recognition systems exist as protbtvpes onlv 
and their vocabularv is not greater than 10 
words CIOD. 
The 'complei<i ty' , and first of ali the great 
'heterogeneity' of speaker-independent speech 
signal represent one of the major obstacles for 
solving this problem more successfuly. 
1 
The! speech signal can be recognized on the 
basis of the so called recognition base 
elements ( words , svllables, phonemes etc). 
This paper describes some problems which 
appear in the process of speaker-independent 
speech recognition on the basis of phoneme 
recognition, and indicates the ways of their 
solution . 
'Features overlapping' of different recognition 
base elements < i. e. when features were 
described bv feature extractian methods and 
presented in n-dimensional space ) and great 
dispersion of features of same base element 
( i.'e. when spoken by an independent speaker) 
represent a great problem in the speaker 
independent speech recognition process. 
Features overlapping mostlv means recognition 
error when the classification is made. 
Speech signal characteristics can be described 
by \Jarious features extraction methods Cl ,3,6, 
71. I 
I 
Dif-ferences in speech signal features of the 
same recognition base element ( for an 
independent speaker) srs due to speaker's age, 
sex t, psvchophysical condition , etc [1,2,61. 
Di-fferent feature extraction methods describe 
speech -features in di-fferent ways. Consequentl y, 
the rate of -features overlapping is di-fferent 
and depends upon the method which has been 
used. 
To achieve high recognition accuracy of 
recognition base elements , the features 
overlapping of different base elements should 
be as smal1 as possible. 
So the proper selection ( definition ) of a 
feature extraction method for particular groups 
of recognition base elements is an important 
condition for a good recognition accuracy. 
In the next sections the feature extraction 
procese of recognition base elements with 
definition of some mapping rules and features 
sets shal1 be described and the basic notion 
with dismembers of two feature extraction 
methods - zero - crossing method ( variant a 
and b) and method of formant frequencies energy 
classes ( variant a and b ) shall be presented. 
b) Set of ali descriptive features - D 
D={D',D=", .... ,DS . 
>, (5) 
D' - the set of descriptive features described 
by the i-th de^cription 
D'=<D'.,D'2, ... ,D^„, D',.>, (b) 
D'r, - the set of descriptive features of the 
n-th recognition base element described 
by the i-th description 
D* =-f D* * n* 3 ,D»" 
>, 
(7) 
D""r. - the set of descriptive features of the 
m-th articulation of the j-th recognition 
base element 
-^D^'"„x,0^• D""„,, ... D"-r.L.>, (8) 
2. Description of recognition base element 
features 
We shall try to describe feature extraction 
process by means of three sets bf speech signal 
features and three sets of map. 
The three sets of features &rE : the set 
of ali descriptive features, the set of ali 
selected features and the set of ali 
characteristic features. Each of the sets 
should be mapped with the following mapping 
sets : the set of descriptive features maps, 
the set of selected features maps and the set 
of characteristiC features maps. 
L - the number of windows of the m-th 
articulation of the j-th recognition 
base element 
D*",,! - the set of descriptive features of the 
1-th window of the m-th articulation of 
the n — th recognition base element 
. ,d""„iw, 
the number of descriptive features 
c) Set of ali selected features - S 
d*"'„XK>, 
(9) 
Such distribution of speech signal features has 
been assumed to estimate the convinience of a 
single feature extraction method which shall 
be used in the feature extraction process. 
S={S»,3=, ,S^, ... >, (10) 
S-"- the set of selected features defined by 
the j-th description 
Analytic evaluation of the importance of a 
single feature description and with it a 
definition of 'optimum' description might also 
be possible. 
Let us describe now briefly single sets of 
features and the sets of maps. 
A> Sets of features 
a) Set of the recognition base elements 
articulation - A . 
•A =<A» ,A2, 
A„, ... AN>, (1) 
N - the number of different recognition base 
elements 
S-' = <S-'.,SJ3, ... ,S-<„, S^^>, (11) 
N - the number of different recognition base 
elements 
S-'„ - the set of selected features of the n-th 
recognition base element defined by the 
j-th description 
SJ„ = -CS-'„.,SJ ,3'r >, (12) 
S-^rip - the p-th selected feature of the n-th 
recognition base element defined by the 
j-th description 
S-'„„ = <s-'„„.,s-
>>, (13) 
^"=<A„x,A. . A. >, 
(2) 
Ar,m -the m-th articulation of the n-th 
recognition base element 
R - the number of elements of the p-th 
selected feature 
A„„=-Ca,-
->, 
(3) 
d) Set of ali characteristic features - C 
L - the number of windaws of the m - th 
arti culati on 
anmi - the 1-th .window of the m-th 
articulation' of the n-th recognition base 
element 
ar.^nl^-Va f-tn>lfa i-imlf ... a nmlf ... a f-itnl-', \H ) 
U - the number of elements in the 1-th window 
C—-CC , Ca 1 ... , 1-" , 
>, 
(14) 
C" - the set of characteristic features 
defined the u-th description 
(15) 
N - the number of different recognition base 
elements 
C"r. - the characteri Stic -feature of the n-th 
recognition base element de+ined by u-th 
descri pti on 
... c"„«>, (16) 
elements o-f the V - the number of 
characteri sti C -feature 
B) Sets o-f maps 
1) Set o^ descriptive feature maps - Fr 
— elements of the set are mapping the set of 
recognition base elements articulation into 
the set of descriptive features 
To—-CfDi,fr>2i ... ffoif ... 5-, 
fo» :A -* D' 
(17) 
(18) 
The map fnitji • D' is surjective. 
2) Set of selected feature maps - G» 
- elements of the set are mapping the set of 
descriptive features into the set of selected 
features 
B.=<g,i ,g.a,- .r. ,g.j, ... >, (19) 
g«j:D* • S-" (20) 
The map g.jiD' • S-* is surjective. 
3. An e»ample of feature extraction method 
di smembers 
Considering the maps and sets mentioned below, 
as an example the two feature extraction 
methods shall be dismembered. The first one is 
the so called sero-crossing method (method from 
the tirne domain ) and the second one is the 
method of formant freguencies ener-gy classes 
( frjequency domain ). 
There are various variants of the zero-crassing 
methjod C5]. 
Almost ali have in common the mapping rule 
of Idescriptive features , i.e. measuring the 
tirne betneen the twa successive zero-crossings 
of a si gnal. 
i 
Single variant 'evaluates' these intervals in 
diffierent ways. 
I 
We shall briefly describe two of them. 
Elements of the descriptive features set are 
de.fined as: 
J: T_ 1,2, 
(23) 
where! 
T„ is the tirne between two successive samples 
j is the number of samples with egual sign 
d^ is the lenght of k-th interval 
K is the number of intervals 
In this way , the set of descriptive features 
D""„j is composed of subsets which contain 
lenght of intervals between two successive 
zeroperOSEi ngs. 
,!i=-Cd' .,d*"r,i ,d" 
(24) 
Variant a (ZCa) 
3) Set Of characteristic feature maps - Se 
- elements of the set are mapping the set of 
descriptive features into the set of 
characteri Stic features 
Elements of the selected features set S'r>i. are 
defined as: 
B^„l,(Xj,X^^,) 
d(TJ,TJ.i) 
(25) 
, g«n , ... iT 1 (21) 
(22) 
-• C" is surjective. 
Bc = <gc1,gc3, . . 
gcuJD' • C" 
The map 9 = ,^: D' -
The map g.:„ is mapping the set of descriptive 
features D» into the set of characteristic 
features C" so that the set of ali intersection 
of the elements of set C" is an empty set. 
The set of ali intersections of the elements 
of selected features set - S-> is not an empty 
set. 
That means, that the elements of characteristic 
features set C" sire disjunctive sets . This is 
not val i d for the elements of selected 
features set S-". 
If- fDi:J» -» D' and -• C" are maps, 
then we may compose foi and g^u to obtain 
a map fo** g€=u;Jt • C'-'. 
We shall define such maps , which are mapping 
the set of recognition base elements 
articulation into the set of characteristic 
features. 
where: 
-d(tij,Tj»,) is the number of intervals in the 
tirne class (Tj,Tj*i> 
(26) 
Value of P is defined by 
P= i E d(TJ,tj..) , 
1 
K is the number of ali intervals . 
The subset S'r,c. of selected features set S^o is 
composed of elements which represent portion of 
interj-vals lenght in particular tirne classes. 
= {s' 
variAnt b (ZCb) 
.>, 
(27) 
Secondly , elements of the selected features 
set S''„_ are defined as follows: 
B'-„J.iX\,,X\,^,)--
n ( TJ , tj -1 > . < tj + tj ^ . ) /2 
W . ( TJ .. . -TJ ) 
(2B) 
where 
n < r J , r J *, ) 
W 
is the number o-f intervale in the 
tirne class (Tj,rj*i) 
is the window width 
are the boundary values a-f the j-th 
tirne class. 
By means o-f -factors {Xj,Xj^i) /2 and Ctj-n- Tj) 
a better evaluation o-f high and low -frequency 
components should be achieved. 
, = <S''„„i,5- ,S''r,RT> 
C29) 
b. Method o+ formant -frequencies energy 
classes (FFEC) 
Like the zer-o-crossi ng method this method knows 
various variants as well. 
AH variants use the discrete Fourier 
transf ormation as the mapping rule o-f the 
descriptive -features C6,9] ; 
G<s)= Z g(u> exp (-j2ltsu/U) 
(30) 
The subset D-""„i_ of the descriptive -features 
set D->"'r, is composed of •frequency samples. 
D->'"„i=<GJ l I ,B-"",il3i ,G-"-„»K>, 
(31) 
Variant a (FFECa) 
To de-fine elements o-f the selected features 
set S*r.B the following prescription has been 
used: 
s*„^(r) 
wher-e 
G^(u) 
s*„„(m) 
R* 
K 
R 
fn.-4-l/R-» X 
= ( Z log G^=»(u))/( Z log Gv-= (u> > j 
u=f„/R* "-' 
r- = l,2 R 
(32) 
is the u-th element in descriptive 
features set D-""r>i 
is the m - th element in selected 
features set S'',.« 
is the resolution factor of DFT 
is the number of ali elemente in the 
descriptive features set D-^^ni 
is the number of elements in the 
selected features set S'^r.p 
Are the boundary values of the m-th 
formant frequencies class 
The subset of the selected features set S^r,^ 
is composed of elements , which represent a 
porti on of maximum frequency components in a 
single formant frequencies clasB. 
= <s^ (35) 
Out of it arises a guestion how efficiency , or 
better 'convenience' of a single map should be 
estimated in order to be used in the base 
element recognition process . 
Far this purpose , the recognition results 
obtained by the dismembered feature extraction 
methods mentioned above , will be presented in 
the next sections. 
4. Experimental results of isolated Slovene 
vawels recognition 
The recognition of the five isolated Slovene 
vowels ( /a/,/e/,/i/,/o/ and /u/ ) was carried 
out by the recognition experiment. 
One hundred and ten articulations of each 
vowel, pronounced by 110 different speakers, 
has been performed. AH articulationa were 
recorded in an studio environment. 
Speakers were of different age categories. 
Female - male rate wa3 3/7. 
The speech signal was passed through a 
band-pass filter (600 Hz - 3.4kHz) and sampled 
at lOk Hr with a 12 bit A/D converter. , 
The time window width (W) was limited to 20 ms. 
Because of such a great amount of different 
speakers we might presume that the recognition 
results (see Table 1 ) are the recognition 
results of an independent speaker. 
Elements of the selected features set ( for a 
single method ) were combined into the feature 
vector and the number of vector elements was 
1imited to ten ; 
Z„„=Cz(1),z(2), ... ,z(10)3, 
(36) 
Classification was made on the bassis of 
multivariate normal distributions with equal 
covariances C7]. 
Recognition results for single methods are 
given in the Table la-b. 
The selected feature set S^r-o is composed of 
elements , which represent parts of common 
energy in particular formant frequencies 
classes. 
S''„„={s'„pi ,s*„p2, ... s«„„«>, <33) 
Variant b (FFECb) 
This variant defines elements of the selected 
features set S^r.„ as follows: 
M 
s-„„(j)=log G^„.„=(j)/( Z log G^™.>,= (m)>, (34) 
m—1 
where 
G>,„.«=(j) is the maximum frequency camponent 
in j-th formant freguencies class 
M is the number of maximum components 
of aH classes and the number of 
elements in the selected features 
set S^„p 
k 
E 
I 
0 
U 
« 
36.1 
O.O 
0.0 
6.1) 
1.9 
Zero 
variant a 
E 
1.8 
71.8 
5." 
2U.5 
11.8 
I 
0.0 
13.7 
90.9 
0.9 
11.6 
- Crossing Method 
variant b 
recognlzed as \i] 
0 U 
1.8 0.0 
5.5 9.0 
0.0 3.7 
62.7 5.5 
10.9 63.6 
A E 
97.3 0.9 
0.9 78.2 
0.0 3.7 
7.3 22.7 
2.7 11.8 
I 
0.0 
10.0 
91.5 
0.0 
17.3 
0 
1.8 
7.3 
0.0 
63.6 
10.9 
U 
0.0 
3.6 
1.8 
6.1 
57.3 
10 
A 
Z 
1 
0 
u 
« 
33.6 
6.3 
3.6 
13.8 
5.5 
Hethod of Foroant Frequenoles E:nepgy Classea 
variant a 
E 
1.8 
70.0 
16. 6 
»•5 
7.2 
I 
0.0 
16.1 
69.0 
1.8 
1.8 
variant b 
recoitnlzed aa [%] 
0 U 
8.2 6.4 
2.7 1.6 
3.6 7.2 
58.1 21.8 
A E I 
97.3 0.0 0.0 
0.0 92.8 5.1 
O.O 11.5 92.8 
1.8 0.0 0.0 
10.1 75.1 1 0.0 0.9 0.0 
0 
2.7 
0.9 
0.0 
92.8 
10.0 
U 
O.C 
• 0.? 
2.7 
5.1 
89.1 
Table la-b: Experin>ental results o-f five 
isolated Slovene vowels recognition 
5. Ef-ficiencv o-f feature extractiDn methods 
We ehal J now try to estimate e-f-ficiencv o^ 
single maps, or better, their 'conveni ence' -for 
the use in the base elements recognition 
process on the basis o-f recognition results. 
By using map rules in the zero-crossing method 
( variant a > a somehaw better recognition 
accuracy was achieved onlv -for the vowel /a/ 
( 96.4* ) - less for the vowel /i/. For the 
vowels /e/ , /o/ and /u/ a rather worse 
recognition accuracy was achieved. 
The variant b of . the zero-crossing method 
showed a little bit better recognition results, 
but the rate o-f vowels recognition error Mas 
rather the same as at the variant a. 
The reason for a Morse recognition accuracy 
when zero-crossing method was applied , should 
be searched in the usage of the map of 
descriptive features. 
In this method ( for both variants ) the 
measurement of interval s lenght as mapping 
rule for mapping the descriptive features was 
used. 
Anyhow, this 'function' is 'incapable' to 
'ignore' phase changes between particular 
frequency components in a signal. 
In other words 
function. 
it is a phase dependent 
Human ear is insensitive to phase changes in a 
speech signal C43, whereas this is not true 
for the 'simple' measurements of intervals 
lenght. 
Two signals with egual frequency components and 
with different phases sound the same. However, 
they can be formed in very different subsets of 
descriptive features , if the rule of the 
measuring interval lenght betmeen the two 
successive zero—crosslngs of the signal was 
used as the mapping rule. 
This is of great importance for phase changes 
at low frequencies (fIrst two formants), which 
have ussualy the greatest amplitude and as such 
a greater influence on the zero-crossing rate. 
Fig^la •^.•lo-^'^ t-ne rirst three 
feature vector formed by 
method (variant a) and th 
freguencies energv classes 
ali articulations of the 
'describe' low freguencies 
spectrum. Fig. Ib represent 
elements of the feature 
articulations of the vowel / 
methods. They describe high 
frBquency spectrum. 
elements of the 
the zero-crossing 
method of formant 
(variant b) for 
vowel /e/. They 
in the frequency 
the last three 
vector for ali 
e/ , for the both 
freguencies in the 
It could be noticed, that the dispersion of the 
first three elements of the feature vector 
fornied by the ZCa method (marked by •*•) , is much 
greather than the dispersion of the feature 
vect'or elements formed by the FFECb method 
(they are labeled as . >. 
ZC» 
FFECb 
'^ ZCi 
. FFEa 
Fig.|la-b : Distribution of the first three a) 
and the last three b) elements of the feature 
vector, for vowel /e/, formed bv ZCa C^) and 
FFECb (.) methods. 
11 
A rather smaller dispersion could be seen at 
the last throe elements o^ the .eature vector 
•formed by the ZCa method. 
m the both cases the dispersian of ^^*ture 
vectors elements formed by the FFECb is very 
similar. 
From the above mentioned the in,portance o+ the 
^act o" phase changes bet^een single ^r-equency 
companents in the frequency spectrum -^^^t be 
noticed - first oi ali , for low frequencie6 
Ce^ng Present in', a speech signal oi an 
independent speaker. 
This fact alsa indicates the recognition 
resClt/af the vo^els /o/ and /u/, for which 
ftrst of ali the first -formant is dominant. 
From the Fig. 2a it can be -1=° =^^"' ^^l%l 
features description of ^^^°'3"^^^°^„„ht as 
elements «ith measurement of interval l^^^^t as 
mapping rule of descriptive features was less 
successful as with Fourier transformation This 
should be evident from the dispersion rate of 
single feature vector elements , which is 
greater than the one for the other t«o 
m^thods. This iS particulary true for the 
second and the third element °^. J-^lJjfr^'^1 
vector (they first of ali describe the first 
formant). 
Comparision of recognition results for variante 
FFECa and FFECb ( see Table Ib ) and 
considerations of dispersion rates of vector 
elements for both variants (Fig. 2 b - ='9^^^ 
indication of the fact that common normalired 
energv of single formant frequencie5 classes 
calculated by this variant was a 'worser 
criteria' than the r.atio of normalized energy 
of maximum components was. This might point out 
that the common energy contents psr single 
formant freguencies classes for some 
recognition element change with an independent 
speaker. It was reflected as an increase of 
dispersion for almost ali elements of tne 
feature vector ( Fig. 2b ). This means a worse 
recognition accuracy < Table Ib ). 
'FFECa method' 
.025 .05 .075 125 .15 .175 .2 .225 .25 
'FFECb method' 
'ZCa method' 
Fiq. 2a-c : Histograms of the feature vector 
elements , for vowel /e/, formed by ZCa a), 
FFECa b) and FFECb c) methods. . 
A better recognition accuracy and the smallest 
features vector elements dispersion waB 
achieved when the mapping rule of method FFtCD 
Mas used. 
The mapping rule^of the selected features for 
this variant 'enables selection' of frequency 
components. In each class only the maximum 
component was choosen. In this way only energy 
of the masimum component for a particular class 
wa5 described. But because of the fact that ten 
formant freguencies classes were defmed, they 
are not aH maximum frequency components o+ 
•f ormants. 
With this 
accuracy 
variant the best average recognition 
,ias achieved ,- greater than 92.5 X. 
12 
6. Conclusion 
By the speaker-independent speech recognition 
such -features maps should be de-fined that 
' di-f-f erences' "in speech -features, appearing 
in the čase o-f an i ndependent speaker shall be 
expressed as small as possible. That means that 
such functions should be de-fined Mhere features 
overiapping was as small as possible. 
This should be valid -for maps o-f descriptive 
-features < e.g. measurement o-f interval« 
lenght - discrete Fourier trans-formation ) and 
for maps of selected features ( e.g. variant a 
- variant b of FFEC method > as wel 1 . 
The mapping rules discussed in our paper showed 
that the discrete Fourier transformation as the 
mapping rule for the descriptive features maps 
and the variant b of the FFEC method as the 
mapping rule for the selected features maps 
gavB the best recognition results. 
Uith above mentioned methods the smallest 
features overiapping and consequently the best 
average recognition accuracy has been achieved 
- i. e. more than 92.S X . 
References 
tli L.R. Rabiner and f*. W. Schafer , Digital 
Processing of Speech Si gnal s, Prentice -
-Hali , Englewood Cliffs , NJ , 1978. 
C23 A. H. Seidman and I. Flores , Handbook of 
Computers and Computing , Van Nostrand 
Reinhold Company , New York , 1984. 
[31 R. De Mori and C.Y. Suen , New Systems and 
Arhitectures for Automatic Seech 
Recognition and Synthesis , Springer -
Verlang, Berln, 1985, Chap. 1, pp. 1 - 72 . 
C4D James C. Anderson , "Improved zero-crossing 
method enhances digital speech " , EDN 
Magazine , vol. 27, No. 20 , october 13 
1982 , pp. 171 - 174 . 
C53 R.J. Niederjohn and P.F. Castelaz, "Zero -
Crossing analysis methods and their use 
for automatic speech' recognition " ,Proc. 
IEEE Computer Society Morkshop on Pattern 
Recognition and Artifical Intel 1igence, 
1978 , pp, 274 - 281 . 
[63 F. Fallside and W.A. Woods, Computer speech 
processing , Prentice - Hali , Englewood 
Cliffs , NJ , 1985 
C7D J. C. Simon, Spoken Language Generation and 
Understanding,D. Reidel Publishing Company, 
19B0 , pp. 129 - 145 
CSO R.J. Senter,Analysis of Data, Scot,Fore5man 
and Campany,111inois , 1969 . 
C93 I. H. Witten, "Digital storage and analysis 
of speech", Wireless world, november 1981, 
pp. 44 - 48 . 
C103 P. Willich, "Putting speech recognizers to 
work" , IEEE Spectrum , april 1987 , 
pp. 55 - 57 . 
cm Z.Kačid, ž.Breif and B.Horvat, "Uspešnost 
metod opisovanja skupnih značilnosti 
osnovnih elementov govornega signala", 
Elektrotehniški vestnik , Vol. 53 (1986), 
No. 3, pp. 121 - 129 .