https://www.youtube.com/watch?v=jZ8wPB-KI8g Sure! Here’s a summary of the Qwen3-TTS family of models:

  • Open Source: The Qwen team recently open-sourced the Qwen3-TTS family, which includes features like voice design, voice cloning, and text-to-speech generation. These models are available on Hugging Face.
  • Model Sizes: There are two main sizes:
    • 0.6B Model: A smaller version supporting 9 premium timbres across 10 languages, suitable for low-latency streaming.
    • 1.7B Model: A larger, more powerful version with advanced capabilities like instruction control for voice design and high-quality voice cloning.
  • Key Features:
    • Multilingual & Multi-dialect Support: Supports 10 mainstream languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) and several dialects.
    • Voice Design: Allows users to describe a voice in natural language (e.g., “a deep, gravelly voice with wisdom”) to generate speech with specific characteristics.
    • Voice Cloning: Can clone any voice using just a 3-second audio sample.
    • Instruction Control: Provides fine-grained control over emotions (happy, sad, etc.), speaking styles (whispering, shouting), and character traits.
    • End-to-End Architecture: Built on a discrete multi-codebook LM architecture for high-speed, high-fidelity speech reconstruction.
    • Smart Text Handling: Capable of correctly pronouncing complex text, such as mathematical equations and technical symbols, without needing phonetic transcriptions.
  • Demos: The models can be tested via a Hugging Face Space demo or through provided Colab notebooks, which showcase various tasks like multi-speaker comparison, batch inference, and long-form text generation.