New parameters have been added to the terminal analog model so that it is now possible to simulate most human voices and to replicate an utterance without noticeable quality reduction. However, it is interesting to note that some voices are easier to model than others. Despite the progress, speech quality is not natural enough in all applications of text to speech. The main reasons for the limited success in formant-based synthesis can be explained by incomplete phonetic knowledge. It should be noted that the transfer of knowledge from phonetics to speech technology has not been an easy process. Another reason is that the efforts using formant synthesis have not explored control methods other than the explicit rule-based description.
Models of segmental coarticulation and other phonetic factors are an important part of a text-to-speech system. The control part of a synthesis system calculates the parameter values at each time frame. Two main types of approaches can be distinguished: rule-based methods that use an explicit formulation of existing knowledge and library-based methods that replace rules by a collection of segment combinations. Clearly, each approach has its advantages. If the data are coded in terms of targets and slopes, we need methods to calculate the parameter tracks. The efforts of Holmes et al. (1964) and the filtered square wave approach by Liljencrants (1969) provide some classical examples in this context.
Synthesis systems based on coding have as long a history as the vocoder. The underlying philosophy is that natural speech is analyzed and stored in such a way that it can be assembled into new utterances. Synthesizers such as the systems from AT&T Bell Labs (Olive, 1977, 1990; Olive and Liberman, 1985), Nippon Telephone & Telegraph (NTT) (Hakoda et al., 1990; Nakajima and Hamada, 1988) and ATR Interpreting Telephone Research Laboratories (ATR) (Sagisaka, 1988; Sagisaka et al., 1992) are based on the source-filter technique where the filter is represented in terms of linear predictive coding (LPC) or equivalent parameters. This filter is excited by a source model that can be of the same kind as the one used in terminal analog systems. The source must be able to handle all types of sounds: voiced and unvoiced vowels and consonants.
One of the major problems in concatenative synthesis is to make the best selection of units and describe how to combine them. Two major factors create problems: distortion because of spectral discontinuity at the connecting points and distortion because of the limited size of the unit set. Systems using elements of different lengths depending on the target phoneme and its function have been explored by several research groups. In a paper by Olive (1990), a new method for concatenating "acoustic inventory elements" of different sizes is described. The system, developed at ATR, is also based on nonuniform units (Sagisaka et al., 1992).
In simple words, one can send a SMS to a stationary telephone, and the SMS text will be pronounced.
It should be noted that the synthesizer is the only one which had been developed especially for the Ukrainian language.
Special methods to generate a unit inventory have been proposed by the research group at NTT in Japan (Hakoda et al., 1990; Nakajima and Hamada, 1988). The synthesis allophones are selected with the help of the context-oriented clustering (COC) method. The COC searches for the phoneme sequences of different sizes that best describe the phoneme realization.
We are currently working in the following directions: - speech recognition for portable devices;
- speaker-independent recognition; - recognition for extra big vocabularies;
- keyword recognition; - speech recognition over the telephone.
Efforts of Tetjana Lyudovyk and Mykola Sazhok, members of the Department, resulted in creation of a Ukrainian speech synthesizer.
The context-oriented clustering approach is a good illustration of a current trend in speech synthesis: automatic methods based on databases. The studies are concerned with much wider phonetic contexts than before. (It might be appropriate to remind the reader of similar trends in speech recognition.) One cannot take into account all possible coarticulation effects by simply increasing the number of units. At some point, the total number might be too high or some units might be based on very few observations. In this case a normalization of data might be a good solution before the actual unit is chosen. The system will become a rule-based system. However, the rules can be automatically trained from data in the same way as speech recognition (Philips et al., 1991).
It is not an easy task to place different synthesis methods into unique classes. Some of the common "labels" are often used to characterize a complete system rather than the model it stands for. A rule-based system using waveform coding is a perfectly possible combination, as is speech coding using a terminal analog or a rule-based diphone system using an articulatory model. In the following pages, synthesis models will be described from two different perspectives: the sound-generating part and the control part of the system.
Charpentier, F., and E. Moulines (1990), "Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones," Speech Commun., 9(5/6):453-467.