With Shakespeare’s invention of commonly used expressions, his creation of new words, and his use of iambic pentameter, he was able to affect the language in a way that no person since has.
A review of personality in voice-based man machine interaction F Metze, A Black… – Human-Computer Interaction. Interaction …, 2011 – Springer … In: Natural, Intelligent and Effective Interaction with Multimodal Dialogue Systems. … Charles C. Thomas Publ. (1995)  Eide, E., Bakis, R., Hamza, W., Pitrelli, J.: Multilayered extensions to the speech synthesis markup language for describing expressiveness. In: Proc. … Cited by 1 – Related articles – All 2 versions
Early electronic speech synthesizers sounded robotic and were often barely intelligible. The quality of synthesized speech has steadily improved, but output from contemporary speech synthesis systems is still clearly distinguishable from actual human speech.
As the cost-performance ratio causes speech synthesizers to become cheaper and more accessible to the people, more people will benefit from the use of text-to-speech programs.6
The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968. In 1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman7 used an IBM 704 computer to synthesize speech, an event among the most prominent in the history of Bell Labs. Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews. Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 8 where the HAL 9000 computer sings the same song as it is being put to sleep by astronaut David Bowman. Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers.9They also use text to speech for a variety of youtube videos called "the secret missing episode of... These videos usually include cartoon characters such as Drew Pickles, Barney the dinosaur, and Ronald Mcdonald.
The most important qualities of a speech synthesis system are and . Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics.
The two primary technologies for generating synthetic speech waveforms are and . Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.
Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.
In the 1930s, Bell Labs developed the Vocoder, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. Homer Dudley refined this device into the VODER, which he exhibited at the 1939 New York World's Fair.
Over a short period, say 25 milliseconds, a speech signal can be approximatedby specifying three parameters: (1) the selection of either a periodic or randomnoise excitation, (2) the frequency of the periodic wave (if used), and (3) thecoefficients of the digital filter used to mimic the vocal tract response. Continuous speech can then be synthesized by continually updating these threeparameters about 40 times a second. This approach was responsible for one theearly commercial successes of DSP: the , a widely marketedelectronic learning aid for children. The sound quality of this type of speechsynthesis is poor, sounding very mechanical and not quite human. However, itrequires a very low data rate, typically only a few kbits/sec.
Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phone (phonetics), diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech recognitionspeech recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram]].10 An index (database) of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency, duration, position in the syllable, and neighboring phones. At run time, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree.
Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech.11 Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.12
This is also the basis for the () method of speechcompression. Digitally recorded human speech is broken into short segments,and each is characterized according to the three parameters of the model. Thistypically requires about a dozen bytes per segment, or 2 to 6 kbytes/sec. Thesegment information is transmitted or stored as needed, and then reconstructedwith the speech synthesizer.