CPU optimised TTS - Kitten AI - Sam Witteveen channel



https://www.youtube.com/watch?v=YpQWdrfzSzQ

CPU optimised TTS

Here is a Markdown summary of the video reviewing Kitten TTS.

🐱 Kitten TTS - Model Overview & Review

Kitten TTS is a new, open-source text-to-speech framework developed by Kitten ML. The primary focus of this project is extreme efficiency, small file sizes, and CPU optimization, making it ideal for edge computing and browser-based applications.


🚀 Key Features

  • Ultra-Lightweight: The smallest model is under 25MB.
  • CPU Optimized: Designed to run without a GPU.
  • Edge Ready: Can run in browsers, mobile phones, and IoT devices with minimal RAM.
  • Open Source: Released under the permissive Apache 2.0 License.
  • Fast Inference: Optimized for real-time speech synthesis.

📦 Model Sizes & Variations

Kitten TTS offers three distinct model sizes, plus a quantized version of the smallest model.

Model NameParametersDisk SizeDescription
Kitten-TTS-Mini80 Million~80 MBThe “largest” model available.
Kitten-TTS-Micro40 Million~41 MBMid-range balance of size/quality.
Kitten-TTS-Nano15 Million~56 MBThe smallest base model.
Nano (Int8)15 Million< 25 MB8-bit quantized version. Extremely portable.

🧪 Performance & Audio Quality

The video demonstrated a comparison between the models using a Google Colab notebook (running entirely on CPU).

  • General Quality: While not achieving the hyper-realism of massive models (like QuenTTS or ElevenLabs), the quality is impressive relative to the tiny file size.
  • Size vs. Quality: Surprisingly, there is not a massive degradation in voice character between the 80M (Mini) and 15M (Nano) models.
  • The 8-Bit Quantized Model:
    • Pros: Runs incredibly fast; file size is negligible.
    • Cons: Introduces some audio artifacts; struggles slightly with punctuation and pausing (sometimes results in run-on sentences).
  • Voices: The system creates embeddings similar to Kokoro TTS. Available voices include:
    • Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo.
    • Notable mentions: Hugo (formal/news anchor style) and Luna (storytelling style) performed well.

🛠️ Technical Details

💭 Conclusion

Kitten TTS represents a shift toward TinyML in the audio space. It proves that TTS systems are becoming efficient enough to run fully client-side (in-browser or on-device) without relying on heavy cloud APIs or expensive GPUs. While the audio quality has minor artifacts in the smallest versions, the trade-off for a <25MB footprint makes it a game-changer for mobile and web apps.


Resources: