A Flite-based synthesizer has been thoroughly tested, and it runs on multiple platforms. The example voice distributed is an 8KHz diphone voice; this is the same voice as kallpc8K as distributed with Festival. That voice is rather old and not very good, but we deliberated wanted to use a stable voice as our first example so we could properly ensure the quality in Flite was the same as it is under Festival.
The following table gives code/data size comparisons for the 8KHz kal voice.
Festival doesn't have a clear separation between its language implementation and its core code so its difficult to give a figure for that. However, the Festival Scheme representation of a basic duration model alone is 35 kilobytes.
Flite Festival core code 50K 2.6M USEnglish 35K ?? lexicon 1.6M 5M diphonedb 2.1M 2.1M
Run-time memory requirements for Flite are less than twice the size of the largest waveform built. In its current form a complete 16 bit waveform is built for each utterance being synthesized, the complete runtime memory requirements are about 1.75 times that size. For our test set of the first two chapters of ``Alice's Adventures in Wonderland,'' the requirement is less that 1 megabyte. For the same task with Festival using the equivalent 8KHz diphone voice the size is about 16-20 megabytes.
The current Flite system with an 8KHz diphone voice has a full footprint of 5M, 4M of code and data and 1M of RAM. The equivalent for Festival is about 30-40M.
As for speech of synthesis, our test consist of the first two chapters of alice which renders to just under 22 minutes of speech. On a 500MHz PIII running Linux, Flite renders this in 19.1 seconds (70.6 times faster than real time) while the equivalent voice in Festival takes 97 seconds (13.4 times faster). Thus Flite is over 5 times faster.
Another key speed test we did was to time how quickly the system can start to speak. For a twenty word utterance, Flite starts writing to the audio device in 45ms, for a 40 word utterance it is about 75ms. The startup time before the first synthesis function is called is about 23ms. For Festival running from the command line the equivalent is about 4-5 seconds. When running as a server and using the client access method and thus exclude the start up time, we still can't make the time less that 1 second for the 20 word utterance and nearer 2 seconds for the 40 word utterance.