AN EFFICIENT UNIT-SELECTION METHOD FOR EMBEDDED CONCATENATIVE SPEECH SYNTHESIS
Jerneja Žganec Gros, Mario Žganec Alpineon, Ljubljana, Slovenia
Key words: text-to-speech synthesis, embedded speech synthesis, unit-selection methods
Abstract: This paper presents a method for selecting speech units for polyphone concatenative speech synthesis, in which the simplification of procedures for search paths in a graph accelerated the speed of the unit-selection procedure with minimum effects on the speech quality. The speech units selected are still optimal; only the costs of merging the units on which the selection is based are less accurately determined. Due to its low processing power and memory footprint requirements, the method is suitable for use in embedded speech synthesizers.
Postopek za izbiro govornih segmentov pri vgrajeni polifonski združevalni sintezi govora
Kjučne besede: sinteza govora, vgrajeni sintetizator govora, izbira govornih segmentov
Izvleček: V prispevku predstavljamo postopek za izbiro govornih segmentov pri polifonski združevalni sintezi govora, pri katerem smo s poenostavitvami postopkov iskanja poti po grafu vplivali na hitrost postopka za izbiro govornih segmentov, vendar tako, da se to čim manj odraža na kvaliteti govora. Izbrani segmenti so še vedno optimalni, le cene lepljenja segmentov, na katerih temelji izbira, so manj natančne. Zaradi računske preprostosti ter majhnih pomnilniških zahtev je postopek primeren za uporabo v vgrajenih sintetizatorjih govora.
1. Introduction
Polyphone or corpus concatenative speecii syntinesis systems usually use extensive speech corpora containing tens of hours of recorded, segmented, and labeled speech, and use memory of several gigabytes. In such a corpus, each basic speech unit or each speech segment constituting a specific series of basic speech units or polyphones occurs repeatedly in various contexts and with different prosodic characteristics /1/.
Limitations in computational processing power and memory footprint used in embedded systems affect the planning of the unit-selection process /2/. The selection of speech units is the part of concatenative or corpus-based speech synthesis that can exert the most influence on the speed of the entire speech synthesis process.
It is necessary to find a favorable compromise between the size of the speech corpus and the computational complexity of the unit-selection procedure /1/. If the unit-se-lection procedure is very simplified and thus also very fast, a selection of units in a larger speech corpus can be performed in the same amount of time. Oversimplification of the procedure can, however, result in the selection of inappropriate speech units and therefore reduce the speech quality despite using a larger corpus. In contrast, choosing a complex unit-selection procedure can ensure an optimal unit selection, but because of time restrictions this can only be performed on a small speech corpus.
The paper is structured in the following way. In section 2, unit-selection in polyphone concatenative speech synthe-
sis is introduced as a graph-search problem. An overview of unit-selection methods is presented.
The unit-selection procedure with which we succeeded in accelerating the speed of the procedure without significantly affecting the speech quality is presented in Section 3. This is achieved by simplifying the calculation of the concatenation cost and thus creating conditions enabling a specific structure of the algorithm for finding the optimal path in the graph.
The evaluation of the speed and the speech quality of the proposed unit-selection procedures is presented in Section 4.
2. Unit-selection in polyphone
concatenative speech synthesis
The task of unit-selection procedures is to find the most appropriate speech units in the corpus such that they produce a maximum-quality signal when merged.
Input data that the unit-selection procedure receives from language processing modules in the speech synthesizer are sequences of phonemes to be pronounced, whereby prosodic parameters for the pronunciation of each phoneme are provided. These parameters contain data on the fundamental frequency and duration of the phoneme pronunciation.
Output data that the unit-selection procedure must convey to the module for concatenating speech segments into a
speech signal are sequences of specific fragments from the speech corpus called polyphones, or the speech units that the concatenation module will have to merge. These sequences can also be equipped with prosodic parameters for each fragment, which enables the concatenating module to convert the original prosodic parameters from the corpus such that they resemble the desired prosodic parameters to the greatest extent possible.
2.1. Search graph for finding the optimal sequence of speech units
The problem of finding the optimal sequence of recorded units for quality speech signal synthesis can be presented as finding an optimal path in a graph. This kind of presentation clearly demonstrates the problem of selecting speech units and, at the same time, enables the use of recognized procedures for solving this problem. Each vertex of the graph represents a basic speech unit from the speech corpus. The basic speech segments may be allophones, diphones, triphones, or any other basic speech unit. The graph is divided into individual levels. The first level contains the initial vertices; that is, all basic speech units in the speech corpus that correspond to the first basic speech unit in the input character sequence that needs to be synthesized.
The edge between the vertices determines the possibility of merging the basic speech units represented by the connected vertices. In merging speech units, the unit of a higher-level vertex chronologically follows the unit of the lower-level vertex. This is why the edges between the vertices are directed. The vertices are interconnected such that each n-level vertex is connected to all n+1 level vertices.
In this kind of graph, finding the optimal speech unit sequence can be defined as finding the optimal path between any initial vertex in the graph (first level of the graph) and any final vertex in the graph (last level of the graph), whereby the edges between the graph's vertices determine the possible paths.
To start searching for the best path in the graph, criteria expressing the final goal of the speech unit selection must be defined as numeric relations between the data represented in the graph. The final goal of the speech unit selection is the maximum possible intelligibility and naturalness of the synthetic speech. In general, the following criteria have been implemented to make the speech as intelligible and natural as possible:
The smallest possible number of speech unit concatenations,
The smallest possible discontinuity of concatenated units at the point of concatenation, The best fit between the concatenated units' prosodic features and the desired speech prosody.
I segmenti | segmenti |	[segment N-1 | segment fT]
input sequence


I I I—I I 11 r
speech corpus
I I' I I I I I I
Fig. 1. Structure of the graph for finding the optimal speech unit sequence; E'i are the graph initial-level vertices, E'n are the graph final-level vertices.
The first two criteria are evaluated by defining the concatenation cost for every edge between the vertices in the graph, whereas the last criterion is evaluated by defining the cost of fit of prosodic features for every vertex. The cost of an individual path in the graph equals the sum of the costs of vertices through which the path runs plus the sum of costs of all the edges the path contains. The optimal path in the graph is the path with the lowest cost.
2.2.	The cost of fit of prosodic features
The cost of fit of prosodic features expresses the similarity or difference between the prosodic features of a specific speech unit from the speech corpus and the desired prosodic features of the part of the speech signal that the speech unit is to form. The required prosodic features can be determined as in /3/ and /4/. The cost of fit of prosodic features usually consists of the weighted result of comparing the speech unit duration and its desired duration, and of the weighted result of comparing the profile of the speech unit basic frequency and the desired fundamental frequency profile. In most cases, the ratio in which the unit's duration and the fundamental frequency profile influence the cost is determined experimentally.
In order to find the optimal speech unit sequence in the graph, the cost of fit of prosodic features is determined for each vertex. It is necessary to calculate this cost for each vertex. Although speech corpora can be very extensive, the calculation of the cost does not constitute a numeric obstacle in finding the optimal path in the graph.
2.3.	Concatenation cost
A speech signal is formed by merging or concatenating speech units from the pre-recorded speech corpus. Dur-
ing the process of merging, audible speech signal discontinuities can occur. We try to evaluate the influence of signal discontinuity on the speech quality through the cost of concatenation.
There are several possible approaches to evaluating the influence of concatenation on the speech quality. The simplest method is to define the cost as "0" for concatenating speech units that directly follow one another in the speech corpus, and to define the cost as "1" for all other speech unit combinations. The use of the cost "0" in units that directly follow one another in the speech corpus is logical because they are already linked together and therefore merging is not necessary. With the use of the cost "1" in units that do not follow one another in the speech corpus, all the concatenations were equally evaluated, regardless of the characteristics of the units being merged. With this kind of concatenation cost, the procedure for finding the optimal speech unit sequence would select the sequence with the smallest number of mergers, regardless of the type of speech units.
A better evaluation of the influence of concatenation on speech quality is achieved if the cost of speech unit merging depends on the allophones that are concatenated. Similar to the previous approach, the cost "0" is defined for the merging of speech units that directly follow one another in the speech corpus.
The most accurate evaluation of the influence of concatenation on speech quality is achieved by taking into account the phonological features of both units merged when calculating the concatenation cost. In this, the differences in the fundamental frequency, formant frequencies, the amplitude, noise factor, noise spectral features, and so on can be taken into account. However, it should be noted that the use of a large number of parameters requires the determination of a large number of weights evaluating the influence of the difference in every parameter on the cost of merging. Determining these weights can be very time-consuming and often includes long-term experiments, empirical solutions, and suppositions. A great deficiency of this method of determining the cost of merging is its numeric complexity. With regard to the fact that concatenation costs are determined individually for every pair of basic speech units from the speech corpus, they are impossible to calculate in advance.
To solve this problem, we propose a compromise solution that is considerably faster, and nonetheless partly takes into account the phonological features of concatenated speech units, is determining the concatenation cost in advance for the individual groups of basic speech units from the speech corpus. In this approach, all the basic speech units in a speech corpus are classified into groups on the basis of their phonological features such that the speech units within an individual group phonologically resemble one another to the best extent possible. This is achieved by using clustering techniques. The concatena-
tion costs are calculated in advance for all group combinations and saved.
2.4. Related work
The optimal path in the graph can be reliably determined by graph traversal whereby all the possible paths in the graph are examined and the best one among them can be selected. The number of possible paths between any initial and final vertex of the graph depends on the number of graph levels and the number of occurrences of the basic speech units in the speech corpus.
Considering that a recording of a speech unit in the speech corpus can occur several thousand times and that input sequences can consist of dozens of basic speech units, it becomes clear that the number of possible paths in the graph is very large. Therefore not all of the possible paths in the graph are investigated, but various procedures are used to simplify and accelerate the search. Some procedures preserve the optimality of the solution, whereas other sacrifice optimality for the sake of faster operation.
The optimal sequence of speech units is determined by minimizing the cost that reflects a decrease in the quality of the synthesized speech due to spectral differences, differences in the phonetic environment, and mutual merging of speech units. The system that was among the first to use the selection of speech units of variable length was the ATR v-Talk /5/. In addition to all the parameters used up until then, Hirokawa also suggested the use of prosod-ic differences in selecting the optimal sequence of speech units /6/. In this approach, synthesized speech is created by concatenating the selected speech units and changing their prosodic features if necessary. The use of information on prosody in the speech unit selection was proposed by Campbell /7/, /8/.
The procedure for minimizing the sums of both costs employs a search based on dynamic programming or one of its derivatives such as A*. Basic speech units or phonemes are usually used as the basic search units. The existing systems that synthesize speech by concatenating speech segments from an extensive speech corpus use this procedure most frequently. The CHATR speech synthesis system was developed on the basis of these methods /9/.
By increasing the number of parameters used in finding or selecting speech segments, the size of the speech corpus has to be large enough. With a sufficiently extensive speech corpus, speech segments that resemble the required input prosodic parameters of the segments can be selected from it. In this case it is not necessary to change the prosodic features before merging the selected speech segments /10/.
Many recent studies that deal with improving the procedures for searching and defining the parameters were taken into account when calculating the cost of segments /11/, /12/. Modeling functions for calculating costs is a complex issue.
In the selection of speech segments, search procedures can use additional labeling of segments of various lengths that mark the critical parts where concatenation could result in the potential distortion of the final speech signal /13/.
Another approach to speech unit selection is the use of static modeling: FSM /14/, DCD /15/, GRM /16/, /17/, and Bulyko /18/.
3. The speech unit selection method with a simplified cost of merging
This section proposes a new and simplified speech unit selection method that is very fast and thus appropriate for implementing the concatenative speech synthesizer in embedded systems.
The basic simplification in this method is that the cost of merging two speech segments depends only on the phonemes that are being joined by merging. If merging is carried out at the center of the phonemes, such as in diphon-ic synthesis, the cost of merging for each phoneme is defined in the center of the phoneme.
If merging is carried out at the phoneme boundaries, the cost of merging must be defined for all the sequences of two phonemes that can occur in speech. These costs of merging can be defined in advance and are not calculated during synthesis. In addition to these costs of merging, it is presumed that the cost of merging equals "0" if the segments that are being merged directly follow one another in the speech corpus, regardless of the phonemes joined at the concatenation point.
thesized. At level /(of the graph, which corresponds to the speech segment Sk, qk vertices are located; at level /c+1, which corresponds to the speech segment Sk+u qk+i vertices are located; and so forth.
Every vertex E'k {l<i<q,^) at level k of the graph represents a specific recording of the speech segment Sk in the speech corpus. For every vertex E'k, the cost of fit of the prosodic features of the corpus speech segment represented by the vertex is also calculated, as well as the required prosodies for the speech segment Sk in the input sequence. This cost is labeled CpiE'k). The vertices are connected by linking every vertex E'k {l<i< g^) at level k with all the vertices eVi	at level k-1. The cost
of the connection between vertices E'k and E'k-^ equals the cost of merging the speech corpus segments represented by the vertices. This is labeled CifEVl, E'k).
In finding the optimal path in the graph it must be established which path between any initial vertex E'i (1< i <q) and any final vertex EV (1 <; < q/^,) of the graph has the lowest cost.
The cost of the entire path is calculated by adding the costs of merging or the costs of edges between the vertices traversed (Cl), and the costs of fit of the prosodic features or the costs of the vertices visited (Op). Thus, at every level k {l<k<N)of the graph only one of the vertices E'k{l<i< q^) must be selected, or only one of the speech segments in the speech corpus that will be used in speech synthesis.
This vertex is labeled The cost of the optimal path in the graph can be expressed as:
C= min



Fig. 2: The costs of merging speech segments are defined for the connections between the graph's vertices; k represents the level of the graph. The costs of fit of the prosodic features are defined for the graph's vertices.
The graph used in speech unit selection is created as described in the previous paragraph. It comprises N levels, whereby each level corresponds to exactly one basic speech segment in the input sequence that is to be syn-
The cost of the optimal path as the function of selecting a vertex x(k) at the individual level of the graph is a decomposable function. If the cost of the optimal path between the graph's initial vertices and the vertex E'k at level k of the graph is labeled CoiE'k), and if the cost of the optimal path between the graph's initial vertices and any /(-level vertex is labeled Ck,, the following applies:
C, =min(c„(i?;«))

and
Q (El) = C,(El) +min(Q (Ei, E^') +	)).
-1(H)
It can be seen that the function of the cost can be defined recursively or that the cost of the path to vertex E'k at level k of the graph depends only on the cost of the prosodic fit for vertex E'k and the costs of optimal paths to the vertices of the previous level {Co{E'k-i)), to which the costs of merging are added.
In optimizing such a function, dynamic programming can be used to find the optimal path in the graph. This method simplifies the search for the optimal path by dividing it into searches for partial optimal paths for every level of the graph.
In practice, the procedure is designed such that four parameters are defined for every vertex of the graph. The first, parameter l(E'k), is an index of the basic speech unit in the speech corpus represented by the vertex. This parameter is already defined for the vertex at the start of the procedure, when the graph is being created. The second parameter equals the cost of fit of prosodic features Cp(E'k), which is also calculated when creating the graph. The third parameter equals the lowest cumulative cost or the lowest cost of the path between any initial vertex and the current CoiE'k) vertex. This cost is calculated during the optimal path calculation procedure. The fourth parameter is an index of the vertex P{E'k) from the previous level of the graph located on the optimal path between the initial vertices and the current vertex. This parameter is also calculated during the graph search procedure.
The procedure begins by defining the cost of fit of the prosodic features of the same vertices for the lowest cumulative cost of initial vertices:
Co{Ei)=C„{E[), (l<(<f/,). In the initial vertices, the indicator of the vertex from the previous level of the graph is set to "0" because initial vertices have no precursor. Then the lowest cost of the path to individual vertices at the second level of the graph is defined:
In addition, the (j) index of the vertex at the pervious level of the graph located on this path with the lowest cost is recorded. This procedure is repeated sequentially for all remaining levels of the graph:
(1)
Fig. 3:
The cost of the optimal path to vertex E'k depends on the costs of optimal paths to the vertices at the graph's previous level Co(E'k-i), the costs ofmerging CdE'k-i, E'k), and the cost of fit of the prosodic features Cp(E'k); l< represents the level of the graph.
The cost of the optimal path is the lowest among the costs of optimal paths to individual final vertices of the graph:
The optimal final vertex is the final vertex with the lowest cumulative cost.
After the procedure is concluded, the sequence of vertices located on the optimal path is compiled by tracing in reverse the indices of vertices at the previous levels of the graph P(E'k) that were saved during the procedure.
With the simplification of the cost of merging introduced in this procedure, the concatenation costs can be determined in advance, so that the cost of merging Cl{£Vi, E'k) depends only on the type (phonological group) of speech segments Sk and Sk-i. This also means that all the costs of the edges between the vertices of the graphs E'k-1 and E'k are the same for any j and /. This does not apply only if the speech segments represented by vertices E'k-i and E'k, directly follow one another in the speech corpus. In this case, the cost of merging equals 0:

0 ;/(i?)^/(£t,)=r
l{E'k) is the index or the consecutive site of the speech segment represented by vertex E'k in the speech corpus.
This means that the calculation of the lowest cost of the path can be further simplified. The recursive equation for calculating the lowest cost of the path to vertex E'k is shown in equation (1).
Taking into account the simplifications above, equation (1) can also be expressed as:
C„(El)=C,iE^)+mm

CoiEL)

(2)
CdSk-^, Sk) is always a positive number. Therefore, equation (2) can also be expressed as:
cM)=CM)+
j=]
if a/;	= l
min(c,(5,._|,5j,)+Co(£5;_,)), otherwise
(3)
Because the calculation of the minimum in the equation above does not depend on /, this calculation can be performed only once for all the vertices incident to the same level Sk of the graph:
Equation (3) can now be expressed as:
)	otherwise
k-1
Sentence length [No. of diphones]
Fig. 4:
i' .......rYi I I I I I I
speech corpus
) - i(EI,)=[
The cost of merging two speech segments directly following one another in the speech corpus equals 0; k represents the level of the graph. The concatenation costs of all other segments depend only on the type (group) of speech segments that are being merged, and are therefore the same for all the connections between two levels of the graph.
It can be established that, by using the unit-selection procedure with the simplified concatenation cost described above, only one calculation of the minimum is required for every level of the graph, and only one sum and one comparison for every vertex of the graph. The time required to calculate the optimal path increases almost linearly with the increase in the size of the speech corpus.
4. Evaluation
4.1. Objective evaluation
Two versions of embedded concatenative speech synthesis using two different methods for selecting speech units were compared according to the quality of synthesized speech and the computational speed.
The first version used a unit-selection method with a simplified concatenation to select speech units described in Section 3, while the second used a simplified method for speech unit selection /19/. Using the second method, the quality of synthesized speech was slightly lower than the quality of the synthesized speech in the first procedure.
The search time in both speech unit selection methods increases linearly with the length of the utterance that is to
200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10

.•.ex
I«
Searcli procedure
B« simplified concatenation cost; t i simplified	i
1000
2000
3000	4000
time [ms]
Fig. 5. Comparison of computational speed for two unit-selection search methods. As anticipated, the simplified method/19/ is faster because it does not search through all the possible paths in the graph, but limits itself only to the most promising ones. The search speed increases with the length of the sentence for which the procedure must find a suitable speech unit sequence in the speech corpus.
be synthesized, and also linearly with the size of the speech corpus, which is an improvement compared to traditional procedures for finding paths in the graph used in speech unit selection of concatenative or corpus speech synthesis. As anticipated, both methods operated increasingly more slowly when increasing the sentence size for which they were seeking the segments required for synthesis.
The simplified search method is faster than the search method with a simplified concatenation cost because it is less complex. As shown in Figure 5, it can find segments for synthesizing shorter utterances twice as fast, and segments for longer sentences four times as fast.
4.2. Subjective evaluation
The proposed method for polyphone concatenative speech synthesis was tested on an embedded device developed for this purpose /20/. A synthesizer for Slovenian speech
was embedded into an automatic system for providing information on honey yields at apicultural observation points.
The intelligibility and naturalness of the synthesized speech was evaluated using an extensive experiment prepared in line with ITU-T recommendations for testing the quality of synthesized speech. The general impression of the speech synthesizer was evaluated as 3.2, or "fair," and corresponds to the general impression of evaluations of state-of-the-art embedded speech synthesizers for other languages that usually receive grades of approx. 3.5 on the MOS scale.
Listeners evaluated the synthesized speech as intelligible, appropriately dynamic and fast, and suitable for use in automatic systems for providing oral information via telephone or the Internet.
5.	Conclusion
This article presents a new method for selecting speech units in polyphone concatenative speech synthesis, in which simplifications of procedures for finding the path in the graph increase the speed of the speech unit-selection procedure with minimum effects on the speech quality. The units selected are still optimal; only the costs of merging the units on which the selection is based are less accurately determined.
Due to its low computational speed and memory footprint requirements, the method is suitable for use in embedded speech synthesizers.
6.	Acknowledgements
Part of the work presented in this paper was performed as part of the VoiceTRAN II project, contract number M2-0132, supported by the Slovenian Ministry of Defence and the Slovenian Research Agency.
7.	References
segments, Proceedings of the ICSLP'90, Kobe, Japan, pp. 337-340.
/7/ Campbell, W.N., Wightman, C.W., (1992). Prosodic encoding of syntactic structure for speecfi synthesis. Proceedings of the ICSLR Banff, Canada, pp. 369-372.
/8/ Campbell, W.N., (1994). Prosody and the selection of units for concatenation synthesis. Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis. New York, USA. pp. 61-64.
/10/ Campbell, W. N., (1997). Processing a speech corpus for CHATR synthesis, Proceedings of the ICSP, Seul, Korea, pp. 183-186.
/11/ Toda T, Kawa, H. , Tsuzak, M., (2004). Optimizing Sub-Cost Functions For Segment Selection Based On Perceptual Evaluations In Concatenative Speech Synthesis, Proceedings of the ICASSF04, pp. 657-660.
/12/ Vepa, J., King, S., (2004). Subjective Evaluation Of Join Cost Functions Used In Unit Selection Speech Synthesis, Proceedings of the in INTERSPEECH '04, pp. 1181-1184.
/13/ Breuer, S., Abresch, J., (2004). Phoxsy: Multi-phone Segments for Unit Selection Speech Synthesis, Institute for Communication Research and Phonetics (IKP) University of Bonn, Prceed-ings of the lnterspeech'04.
/14/ Mohri, M., Pereira, F. C. N., Riley, M., (2000). The Design Principles of a Weighted Finite-State Transducer Library, Theoretical Computer Science, Vol. 231, No.1, pp.17-32.
/15/ Allauzen, C., Mohri, M., Riley M,, (2003). DCD Library-Decoder Library, software collection for decoding and related functions. In AT&T Labs - Research.
/16/ Allauzen, C., Mohri, M., Roark, B., (2004). A General Weighted Grammar Library, Proceedings of the Ninth International Conference on Automata (CIAA 2004), Kingston, Canada.
/17/ Yi, J, R. W., (2003). Corpus-Based Unit Selection for Natural-Sounding Speech Synthesis, PhD Thesis, Massachusetts Institute of Technology
/18/ Bulyko, I., Ostendorf, M,, (2001). Unit Selection for Speech Synthesis Using Splicing Costs with Weighted Finite State Transducers, Proceedings of the EUROSPEECH '01, Aalborg, Danmark. Vol. 2, pp. 987-990.
/19/ Mihelič, A., (2006). Sistem za umetno tvorjenje slovenskega govora, ki temelji na izbiri in združevanju nizov osnovnih govornih enot, PhD Thesis, Faculty of Electrical Enginering, University of Ljubljana.
/20/ Mihelič, A., Žganec Gros, J,, Pavešič, N., Žganec, M. (2006). Efficient Subset Selection from Phonetically Transcribed Text Corpora for Concatenation-based Embedded Text-to-speech Synthesis. Inf. MIDEM, Mar. 2006, Vol. 36, No. 1, pp. 19-24.
/1/ Beutnagel, M., Conkie, A., Schroeter, J. and Stylianou, Y, "The AT&T Next-Gen TTS System", in Proceedings of the 137"" Meeting of the Acoustic Society of America, 2000. /2/ Levy, C., Linares, G., Nocera, P, Bonastre, J. F, (2004). Reducing Computational and Memory Cost for Cellular Phone Embedded Speech Recognition System, Proceedings of the ICASSP'04, Montreal, Canada, Vol. IV, pp. 489-492. /3/ Vesnicer, B., Mihelič, F., (2004). Sinteza slovenskega govora z uporabo prikritih Markovovih modelov. Elektrotehniški vestnik, Vol. 71, No, 4, pp, 223-228, /9/ Black, A.W., Taylor, P, (1994). CHATR: a generic speech synthesis system, Proceedings of the COLiNG, Kyoto, Japan, pp. 983-986, /4/ Mihelič, F, Vesnicer, B,, Žibert, J,, Noeth, E,, (2007). Prosody Evaluation for Embedded Slovene Speech-Synthesis Systems, Inf. MIDEM, Sep, 2007, Vol, 37, No, 3,pp 176-181, /5/ Sagisaka, Y, Kaiki, N,, Iwahashi, N., Mimura, K, (1992). ATR ff-talk speech synthesis system. Proceedings of the ICSLP'92, Banff, Canada, pp. 483-486. /6/ Hirokawa, T, Hakoda, K, (1990), Segment selection and pitch modification for high quality speech synthesis using waveform
dr. Jerneja Žganec Gros, dr. Mario Žganec Alpineon , Ulica Iga Grudna 15, SI-1000 Ljubljana,
Slovenia info@alpineon.com tel +386 1 423 9440 tel +386 1 423 9445