There has been widespread controversy over the quality and suitably characteristics of these two structures. It is easy to see that good results with only one basic method is difficult to achieve so some efforts have been made to improve and combine these basic models. In 1980 Dennis Klatt (Klatt 1980) proposed a more complex formant synthesizer which incorporated both the cascade and parallel synthesizers with additional resonances and anti-resonances for nasalized sounds, sixth formant for high frequency noise, a bypass path to give a flat transfer function, and a radiation characteristics. The system used quite complex excitation model which was controlled by 39 parameters updated every 5 ms. The quality of Klatt Formant Synthesizer was very promising and the model has been incorporated into several present TTS systems, such as MITalk, DECtalk, Prose-2000, and Klattalk (Donovan 1996). Parallel and cascade structures can also be combined by several other ways. One solution is to use so called PARCAS (Parallel-Cascade) model introduced and patented by Laine (1982) for SYNTE3 speech synthesizer for Finnish. In the model, presented in Figure 5.3, the transfer function of the uniform vocal tract is modeled with two partial transfer functions, each including every second formant of the transfer function. Coefficients k1, k2, and k3 are constant and chosen to balance the formant amplitudes in the neutral vowel to keep the gains of parallel branches constant for all sounds (Laine 1982).
Klatt, D. H. (1979a). "Synthesis by Rule of Segmental Durations in English Sentences," in , edited by B. Lindblom and S. Öhman (Academic, New York), pp. 287-300.
Klatt, D. H. (1979b). "Synthesis by Rule of Consonant-Vowel Syllables," Speech Communication Group Working Papers 3, MIT, Cambridge, MA, pp. 93-104.
Klatt, D. H. (1980). "Software for a Cascade/Parallel Formant Synthesizer," J. Acoust. Soc. Am. 67, 971-995.
Klatt, D. H. (1981). "A Text-to-Speech Conversion System," Proc. AFIPS Office Automation Conference, pp. 51-61.
Klatt, D. H. (1982a). "The Klattalk Text-to-Speech System," Proc. Int. Conf. Acoust. Speech Signal Process. ICASSP-82, 1589-1592.
Klatt, D. H. (1982b). "A Strategy for the Perceptual Interpretation of Durational Cues," Speech Communication GroupWorking Papers 1, MIT, Cambridge, MA, pp. 83-91.
Klatt, D. H. (1982c). "Prediction of Perceived Phonetic Distance from Critical-Band Spectra: A First Step," Proc. Int. Conf. Acoust. Speech Signal Process. ICASSP-82, 1278-1281.
Klatt, D. H. (1986a). "Representation of the First Formant in Speech Recognition and in Models of the Auditory Periphery," in Proc. Montreal Satellite Symposium on Speech Recognition, edited by P. Mermelstein, Twelfth Int. Cong. Acoustics, Toronto, Canada, pp. 5-7.
Klatt, D. H. (1986b). "Detailed Spectral Analysis of a Female Voice," J. Acoust. Soc. Am. Suppl. 1 80, S97.
Klatt, D. H. (1987a). "How Klattalk became DECtalk: An Academic's Experiences in the Business World," Speech Tech. 87, 293-294.
Klatt, D. H., and Aoki, C. (1984). "Synthesis by Rule of Japanese," J. Acoust. Soc, Am. Suppl. 1 76, S2.
Klatt, D. H., and Shipman, D. W. (1982). "Letter-to-Phoneme Rules: A Semi-Automatic Discovery Procedure," J. Acoust. Soc. Am. Suppl. 1 72, S48.
Koenig, W. H., Dunn, H. K., and Lacey, L. Y. (1946). "The Sound Spectrograph," J. Acoust. Soc. Am. 18, 19-49.
Kucera, H., and Francis, W. N. (1967). (Brown U.P., Providence, RI).
Kurzweil, R. (1976). "The Kurzweil Reading Machine: A Technical Overview," in , edited by M. R. Redden and W. Schwandt (American Association for the Advancement of Science, Report 76-R-11, Washington, DC), pp. 3-11.
Labov, W. (1986). "Sources of Inherent Variation in the Speech Process," in , edited by J. Perkell and D. H. Klatt (Erlbaum, Hillsdale, NJ), pp. 402-425.
Ladd, D. R. (1983). "Phonological Features of Intonational Peaks," Language 59, 721-759.
Ladefoged, P. (1973). "The Features of the Larynx," J. Phonetics 1, 73-83.
Lamel, L., and Zue, V. (1984). "Properties of Consonant Sequences within Words and across Word Boundaries," Proc. Int. Conf. Acoust. Speech Signal Process. ICASSP-84, 42.3.1-43.2.4.
Lawrence, W. (1953). "The Synthesis of Speech from Signals which have a Low Information Rate," in , edited by W. Jackson (Butterworths, London, England), pp. 460-469. [Ed: Reprinted in Flanagan and Rabiner, 1973.]
Lee, D., and Lochovsky, F. (1983). "Voice Response Systems," ACM Computing Surveys 15, 351-374.
Lee, F. F. (1969). "Reading Machine: From Text to Speech," IEEE Trans. Audio Electroacoust. AU-17, 275-282.
Lehiste, I. (1959). "An Acoustic-Phonetic Study of Internal Open Juncture," Suppl. to Phonetica. 5, 1-55.
Lehiste, I. (1962). "Acoustical Characteristics of Selected English Consonants," Univ. Michigan Speech Research Lab. Report 9, 1-219.
Lehiste, I. (1964). "Juncture," in , edited by E. Zwirner and W. Bethge (Karger, Basal, Switzerland), pp. 172-200.
Lehiste, I. (1967). (MIT Press, Cambridge, MA).
Lehiste, I. (1970). (MIT Press, Cambridge, MA).
Lehiste, I. (1975a). "Some Factors Affecting the Duration of Syllabic Nuclei in English," , edited by G. Drachman (Verlag Gunter, Narr), pp. 81-104.
Lehiste, I. (1975b). "The Phonetic Structure of Paragraphs," in , edited by A. Cohen and S. Nooteboom (Springer, Heidelberg, Germany), pp. 195-206.
Lehiste, I. (1977). "Isochrony Reconsidered," J. Phonetics 5, 253-263.
Lehiste, I., Olive, J. P., and Streeter, L. A. (1976). "The Role of Duration in Disambiguating Syntactically Ambiguous Sentences," J. Acoust. Soc. Am. 60, 1199-1202.
Lehiste, I., and Peterson, G. E. (1959). "Linguistic Considerations in the Study of Speech Intelligibility," J. Acoust. Soc. Am. 31, 280-287.
Lehiste, I., and Peterson, G. E. (1961). "Some Basic Considerations in the Analysis of Intonation," J. Acoust. Soc. Am. 33, 419-425.
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., and Studdert-Kennedy, M. (1967). "Perception of the Speech Code," Psychol. Rev. 74, 431-461.
Liberman, A. M., Delattre, P., and Cooper, F. S. (1958). "Some Cues for the Distinction between Voiced and Voiceless Stops in Initial Position," Lang. Speech 1, 153-167.
Liberman, A. M., Delattre, P., Cooper, F. S., and Gerstman, L. J. (1954). "The Role of Consonant-Vowel Transitions in the Perception of the Stop and Nasal Consonants," Psychol. Monogr. 68, 1-13.
Liberman, A. M., Ingemann, F., Lisker, L., Delattre, P., and Cooper, F. (1959). "Minimal Rules for Synthesizing Speech," J. Acoust. Soc. Am. 31, 1490-1499.
Liberman, A. M., and Mattingly, I. G. (1985). "The Motor Theory of Speech Perception Revisited," Cognition 21, 1-36.
Liberman, M. Y. (1979). "Phonemic Transcription, Stress, and Segment Durations for Spelled Proper Names," J. Acoust. Soc. Am. Suppl. 1 64, S163.
Lieberman, P. (1967). (MIT Press, Cambridge, MA).
Liljencrants, J. (1969). "Speech Synthesizer Control by Smoothed Step Functions," Speech Transmission Laboratory, Royal Institute of Technology, Stockholm, Sweden QPSR-4, 43-50.
Liljencrants, J. (1985). "Speech Synthesis with a Reflection-Type Line Analog," unpublished Ph.D. thesis, Dept. Speech Commun. and Musical Acoust., Royal Inst. of Tech., Stockholm, Sweden.
Lindblom, B. (1963). "Spectrographic Study of Vowel Reduction," J. Acoust. Soc. Am. 35, 1773-1781.
Lisker, L. (1957). "Minimal Cues for Separating /wrly/ in Intervocalic Position," Word 13, 256-267.
Lisker, L. (1978). "Rapid vs. Rabid: A Catalog of Acoustic Features that may Cue the Distinction," Status Report on Speech Research 54, Haskins Laboratories, New Haven, CT, pp. 127-132.
Lisker, L., and Abramson, A. S. (1967). "Some Effects of Context on Voice Onset Time in English Stops," Lang. Speech 10, 1-28.
Logan, J. S., and Pisoni, D. B. (1986). "Preference Judgements Comparing Different Synthetic Voices," J. Acoust. Soc. Am. Suppl. 1 79, S24.
Logan, J. S., Pisoni, D. B., and Greene, B. G. (1986). "Measuring the Segmental Intelligibility of Synthetic Speech: Results from Eight Text-to-Speech Systems," submitted to J. Acoust. Soc. Am.
Lucassen, J. M., and Mercer, R. L. (1984). "An Information Theoretic Approach to the Automatic Determination of Phonemic Base Forms," Proc. Int. Conf. Acoust. Speech Signal Process. ICASSP-84, 42.5.1-42.5.4.
Luce, P., Feustel, T., and Pisoni, D. (1983). "Capacity Demands in Short-Term Memory for Synthetic and Natural Speech," Human Factors 25, 17-31.
MacNeilage, P. F., and DeClerk, J. L. (1969). "On the Motor Control of Coarticulation of CVC Syllables," J. Acoust. Soc. Am. 45, 1217-1233.
Maeda, S. (1974). "A Characterization of Fundamental Frequency Contours of Speech," Research Laboratory of Electronics QPR 114, MIT, Cambridge, MA, pp. 193-211.
Maeda, S. (1987). "On the Generation of Sounds in Stop Consonants," Speech Communication Group Working Papers V, MIT, Cambridge, MA, pp. 1-14.
Magnusson, L., Blomberg, M., Carlson, R., Elenius, K., and Granström, B. (1984). "Swedish Speech Researchers Team Up with Electronic Venture Capitalists," Speech Technol. 2, 15-24.
Makhoul, J. (1973). "Spectral Analysis of Speech by Linear Prediction," IEEE Trans. Audio Electroacoust. AU-21, 140-148.
Malecot, A. (1956). "Acoustic Cues for Nasal Consonants: An Experimental Study Involving a Tape-Splicing Technique," Language 32, 274-284.
Malme, C. I. (1959). "Detectability of Small Irregularities in a Broadband Noise Spectrum," Research Lab. of Electronics Q.P.R. 52, Mass. Inst. Tech., pp. 139-141.
Manous, L. M., Pisoni, D. B., Dedina, M. J., and Nusbaum, H.C. (1985). "Comprehension of Natural and Synthetic Speech Using a Sentence Verification Task," Speech Research Laboratory Progress Report 11, Indiana University, Bloomington, IN, pp. 33-58.
where the "rules" may contain information of in which cases the current abbreviation is converted, e.g., if it is accepted in capitalized form or accepted with period or colon. Preceding and following information may contain also the accepted forms of ambient text, such as numbers, spaces, and character characteristics (vowel/consonant, capitalized etc.).Sometimes different special modes, especially with numbers, are used to make this stage more accurate, for example, math mode for mathematical expressions and date mode for dates and so on. Another situation where the specific rules are needed is for example the E-mail messages where the header information needs special attention.Analysis for correct pronunciation from written text has also been one of the most challenging tasks in speech synthesis field. Especially, with some telephony applications where almost all words are common names or street addresses. One method is to store as much names as possible into a specific pronunciation table. Due to the amount of excisting names, this is quite unreasonable. So rule-based system with an exception dictionary for words that fail with those letter-to-phoneme rules may be a much more reasonable approach (Belhoula et al. 1993). This approach is also suitable for normal pronunciation analysis. With morphemic analysis, a certain word can be divided in several independed parts which are considered as the minimal meaningful subpart of words as prefix, root, and affix. About 12 000 morphemes are needed for covering 95 percent of English (Allen et al.1987). However, the morphemic analysis may fail with word pairs, such as heal/health or sign/signal (Klatt 1987).Another perhaps relatively good approach to the pronunciation problem is a method called where a novel word is recognized as parts of the known words and the part pronunciations are built up to produce the pronunciation of a new word, for example pronunciation of word may be constructed from and (Gaved 1993). In some situations, such as speech markup languages described later in Chapter 7, information of correct pronunciation may be given separately.Prosodic or suprasegmental features consist of pitch, duration, and stress over the time. With good controlling of these gender, age, emotions, and other features in speech can be well modeled. However, almost everything seems to have effect on prosodic features of natural speech which makes accurate modeling very difficult. Prosodic features can be divided into several levels such as syllable, word, or phrase level. For example, at word level vowels are more intense than consonants. At phrase level correct prosody is more difficult to produce than at the word level.The pitch pattern or fundamental frequency over a sentence (intonation) in natural speech is a combination of many factors. The pitch contour depends on the meaning of the sentence. For example, in normal speech the pitch slightly decreases toward the end of the sentence and when the sentence is in a question form, the pitch pattern will raise to the end of sentence. In the end of sentence there may also be a continuation rise which indicates that there is more speech to come. A raise or fall in fundamental frequency can also indicate a stressed syllable (Klatt 1987, Donovan 1996). Finally, the pitch contour is also affected by gender, physical and emotional state, and attitude of the speaker.The duration or time characteristics can also be investigated at several levels from phoneme (segmental) durations to sentence level timing, speaking rate, and rhythm. The segmental duration is determined by a set of rules to determine correct timing. Usually some inherent duration for phoneme is modified by rules between maximum and minimum durations. For example, consonants in non-word-initial position are shortened, emphasized words are significantly lengthened, or a stressed vowel or sonorant preceded by a voiceless plosive is lengthened (Klatt 1987, Allen et al. 1987). In general, the phoneme duration differs due to neighboring phonemes. At sentence level, the speech rate, rhythm, and correct placing of pauses for correct phrase boundaries are important. For example, a missing phrase boundary just makes speech sound rushed which is not as bad as an extra boundary which can be confusing (Donovan 1996). With some methods to control duration or fundamental frequency, such as the PSOLA method, the manipulation of one feature affects to another (Kortekaas et al. 1997).The intensity pattern is perceived as a loudness of speech over the time. At syllable level vowels are usually more intense than consonants and at a phrase level syllables at the end of an utterance can become weaker in intensity. The intensity pattern in speech is highly related with fundamental frequency. The intensity of a voiced sound goes up in proportion to fundamental frequency (Klatt 1987). The speaker's feelings and emotional state affect speech in many ways and the proper implementation of these features in synthesized speech may increase the quality considerably. With text-to-speech systems this is rather difficult because written text usually contains no information of these features. However, this kind of information may be provided to a synthesizer with some specific control characters or character strings. These methods are described later in Chapter 7. The users of speech synthesizers may also need to express their feelings in "real-time". For example, deafened people can not express their feelings when communicating with speech synthesizer through a telephone line. Emotions may also be controlled by specific software to control synthesizer parameters. Such system is for example HAMLET (Helpful Automatic Machine for Language and Emotional Talk) which drives the commercial DECtalk synthesizer (Abadjieva et al. 1993, Murray et al. 1996).This section shortly introduces how some basic emotional states affect voice characteristics. The voice parameters affected by emotions are usually categorized in three main types (Abadjieva et al. 1993, Murray et al. 1993):The number of possible emotions is very large, but there are five discrete emotional states which are commonly referred as the primary or basic emotions and the others are altered or mixed forms of these (Abadjieva et al. 1993). These are anger, happiness, sadness, fear, and disgust. The secondary emotional states are for example whispering, shouting, grief, and tiredness. in speech causes increased intensity with dynamic changes (Scherer 1996). The voice is very breathy and has tense articulation with abrupt changes. The average pitch pattern is higher and there is a strong downward inflection at the end of the sentence. The pitch range and its variations are also wider than in normal speech and the average speech rate is also a little bit faster.
An analysis-by-synthesis approach using the Klatt formant synthesizer was applied to study 24 tokens of the vowel /a/ spoken bymales and females with severe voice disorders.