An Introduction to text-to-speech synthesis

If you are developing a commercial or industrial software product, you canlicense the SoftVoice text-to-speech system for inclusion. Licensing of theSoftVoice TTS engine can be done in a number of ways, including (but notlimited to):
- A per-unit royalty with large-volume discounts, or
- A yearly subscription, or
- A single, one-time fee.

For information on licensing the SoftVoice TTS engine - or for generalquestions - please contact us at: for details.

Computer speech synthesis has reached a high level of performance, with increasingly sophisticated models of linguistic structure, low error rates in text analysis, and high intelligibility in synthesis from phonemic input. Mass market applications are beginning to appear. However, the results are still not good enough for the ubiquitous application that such technology will eventually have. A number of alternative directions of current research aim at the ultimate goal of fully natural synthetic speech. One especially promising trend is the systematic optimization of large synthesis systems with respect to formal criteria of evaluation. Speech recognition has progressed rapidly in the past decade through such approaches, and it seems likely that their application in synthesis will produce similar improvements.

Speech Recognition is the process by which a computer maps an acoustic speech signal to text.

A Short Introduction to Text-to-Speech Synthesis

Speech Synthesis, also known as text-to-speech or SAPI, is a process of converting text to speech

The largest scale of commercial activity has been of types 1 and 2, which might be called stored voice. This includes telecommunication intercepts, Texas Instruments' Speak 'N Spell toy, voice-mail prompts, and so forth. Much classical speech synthesis research was of type 5 or 6. Several of the best current systems, and what some consider to be the most promising areas of research, are of types 3 and 4, techniques that are sometimes called

My-own-voice offers end users the possibility to keep speaking and communicating, not only by using speech synthesis as a voice companion, but also by using their own voice, synthetically re-created, helping to fully maintain the user’s identity. My-own-voice can already be performed in up to 10 languages and new languages will regularly be added.

For individuals who have already lost their voice, they can ask a family member, a close relative or a friend, to donate their voice, offering the possibility to the end user to speak with a voice that sounds familiar and unique.

The pronunciation of a certain word may also be different due to contextual effects. This is easy to see when comparing phrases and . The pronunciation of depends on the initial phoneme in the following word. Compound words are also problematic. For example the characters 'th' in mother and hothouse is pronounced differently. Some sounds may also be either voiced or unvoiced in different context. For example, phoneme /s/ in word is voiced, but unvoiced in word (Allen et al. 1987).

Finding correct pronunciation for proper names, especially when they are borrowed from other languages, is usually one of the most difficult tasks for any TTS system. Some common names, such as Nice and Begin, are ambiguous in capitalized context, including sentence initial position, titles and single text. For example, the sentence is very problematic because the word may be pronounced as /niis/ or /nais/. Some names and places have also special pronunciation, such as Leicester and Arkansas. For correct pronunciation, these kind of words may be included in a specific exception dictionary. Unfortunately, it is clear that there is no way to build a database of all proper names in the world.

Finding correct intonation, stress, and duration from written text is probably the most challenging problem for years to come. These features together are called prosodic or suprasegmental features and may be considered as the melody, rhythm, and emphasis of the speech at the perceptual level. The intonation means how the pitch pattern or fundamental frequency changes during speech. The prosody of continuous speech depends on many separate aspects, such as the meaning of the sentence and the speaker characteristics and emotions. The prosodic dependencies are shown in Figure 4.1. Unfortunately, written text usually contains very little information of these features and some of them change dynamically during speech. However, with some specific control characters this information may be given to a speech synthesizer.

The second task is to find correct pronunciation for different contexts in the text. Some words, called , cause maybe the most difficult problems in TTS systems. Homographs are spelled the same way but they differ in meaning and usually in pronunciation (e.g. fair, lives). The word is for example pronounced differently in sentences "Three were lost" and "One to eat". Some words, e.g. , has different pronunciations when used as a verb or noun, and between two noun senses (He followed her / He covered the hull with ). With these kind of words some semantical information is necessary to achieve correct pronunciation.

Timing at sentence level or grouping of words into phrases correctly is difficult because prosodic phrasing is not always marked in text by punctuation, and phrasal accentuation is almost never marked (Santen et al. 1997). If there is no breath pauses in speech or if they are in wrong places, the speech may sound very unnatural or even the meaning of the sentence may be misunderstood. For example, the input string "John says Peter is a liar" can be spoken as two different ways giving two different meanings as "John says: Peter is a liar" or "John, says Peter, is a liar". In the first sentence Peter is a liar, and in the second one the liar is John.

There are many methods to produce speech sounds after text and prosodic analysis. All these methods have some benefits and problems of their own.

4. a message drawn from unrestricted digital text, including anything from electronic mail to on-line newspapers to patent or legal texts, novels, or cookbooks;

