A group file contains the basic parameters, the diphone index, thesignal (original waveform or LPC residual), LPC coefficients, and thepitch marks. It is all you need for a run-time synthesizer. Various compression mechanisms are supported to allow smaller databasesif desired. A full English LPC plus residual database at 8k ulawis about 3 megabytes, while a full 16 bit version at 16k is about8 megabytes.

The standard method for diphone resynthesis in the released system isresidual excited LPC (). The actual method of resynthesisisn’t important to the database format, but if residual LPC synthesisis to be used then it is necessary to make the LPC coefficientfiles and their corresponding residuals.

I have seen a few high quality diphone synthesizers with small voice sample libraries. When reading various papers about speech synthesis and, specifically, about those smaller synthesizers, they say they have used LPC (linear predictive coding) to make their voice sample library much smaller in size, and also they say that LPC give them additional benefits of easier pitch control when assembling speech from voice samples.

A hybrid time domain and LPC approach to speech pitch control is developed. This approach uses a low order LPC analysis and residual excitation to alter pitch period length during voiced speech. This approach differs from standard residual excited LPC in that LP reconstruction is applied only during voiced segments. Listening tests were used to compare PSOLA and the hybrid method under conditions of increasing or decreasing F0 in natural speech tokens. The natural speech was recorded by two talkers, a female adult and female child. Results suggest that, while the overall performance of the two methods is similar, the methods differ in their effectiveness with direction of F0 shift and over talkers. Keywords--- Speech, diphone synthesis, pitch modification. I. Introduction Control of intonation and timing is difficult in diphone concatenation synthesis, especially when one aim of the synthesis is to capture and present the voice characteristics of a specific talker. One method for c...

In the mid 1990's work commenced on the integration of channel vocoder based diphone concatenation into the SHLRC TTS system. The techniques adopted were based on prior research by Clark and Mannell which had produced a substantial amount of quantitative evidence on the relationship between human speech perception and synthesiser design. The most intelligible speech was produced by channel synthesis utilising a model of the human auditory periphery (Bark frequency scale). The speech produced by such a system was significantly more intelligible (for consonants) than was formant-based synthesis. It was anticipated that the quality of the speech produced by such a Bark scaled channel synthesis system would be superior to that of existing synthesisers based on formant or LPC methods.

Word, syllable and diphone or demi-syllable concatenation synthesis systems are not new (Olive & Nakatani 1974, Olive 1977, Shadle & Atal 1978). Most such systems utilise formant synthesis methods (eg. ten Bosch et al 1989), LPC methods (eg. Rodet & Depalle 1985, Stella & Charpentier 1985) or waveform concatenation (Charpentier et al 1986, 1989). Unless LPC systems utilise a large number of coefficients they suffer from one of the drawbacks of formant systems. That is, they make a priori assumptions about the number of poles required to model the speech adequately. Further, LPC analysis is limited in its ability to model anti-resonances (but see Markel & Gray 1976, pp 271-275). LPC analysis also makes no assumptions about the perceptual importance of peaks in various parts of the spectrum. Two peaks which are prominent in the O - 4 kHz band will be treated identically to two equally prominent and equally separated peaks in the O - 1 kHz band. In other words, LPC systems are not normally modelled on the characteristics of the auditory system (although conceivably they could be weighted to simulate auditory models). The diphone concatenation system of Charpentier et al (1986, 1989) utilises pitch synchronous overlapping frames stored as 512 point FFTs which are added together using the overlap-add algorithm (also used in our channel vocoder). Charpentier's method is analogous to a uniform filterbank method consisting of 256 uniformly spaced BP filters and, in a similar fashion to the LPC methods, makes no attempt to utilise an auditory model. In consequence, the size of Charpentier's diphone library is approximately 7 Mbytes, much of which consists of auditorily redundant data.

