Generated: 2026-04-29 · API: Gemini 2.5 Flash · Modes: Summary


Google DeepMind’s Gemma 4: Open-Source AI Models and Architectural Innovations

Clip title: Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind Author / channel: AI Engineer URL: https://www.youtube.com/watch?v=_A367W_qvc8

Summary

Google DeepMind has introduced Gemma 4, the latest generation in their family of open-source AI models, which the presenter highlights for its unprecedented performance at smaller scales. Launched under an Apache 2.0 License, Gemma 4 aims to be more accessible for everyday developers, fostering easier integration into various development workflows from testing to deployment. The release features a range of models designed for diverse applications and hardware, marking a significant step forward in open-source AI capabilities.

The Gemma 4 family includes four distinct models: two smaller, “effective” models (E2B and E4B) optimized for on-device applications like phones, iPads, and laptops, and two larger models (26B and 31B) for more complex reasoning tasks. The E2B and E4B support text, vision, and audio inputs, while the 26B and 31B focus on text and vision, with the 26B being the first Gemma Mixture of Experts (MoE) model. Notably, the larger 31B (Dense) and 26B (MoE) models have achieved top rankings on the LLM Arena, outperforming models significantly larger than themselves, setting a new benchmark for what small open-source models can achieve.

A core focus of Gemma 4’s enhancements lies in its refined architecture. Improvements to the attention mechanism include a strategic interleaving of local and global layers (5:1 ratio, or 4:1 for E2B), utilizing sliding windows for local attention while global layers attend to all preceding tokens. A key innovation, Grouped Query Attention (GQA), boosts efficiency by having groups of queries share key and value heads—two for local layers and eight for global layers (with doubled key/value head lengths for global layers). Furthermore, the 26B model introduces a Mixture of Experts (MoE) design, featuring 128 total experts with eight active per forward pass, alongside a larger, constantly active shared expert. For the on-device effective models (E2B and E4B), Per-Layer Embeddings (PLE) were implemented, storing embedding tables in flash memory instead of VRAM, drastically reducing memory overhead and enabling superior performance on constrained hardware.

Gemma 4 also extends its native multimodal capabilities, building upon the vision support introduced in Gemma 3, and now adding audio to the E2B and E4B models. The vision encoder supports variable aspect ratios and resolutions for images, allowing developers to choose from five different soft token budgets to optimize for specific tasks like OCR or object recognition. Images are processed by splitting them into 16x16 pixel patches, which are then pooled into single embeddings and passed to the model, ensuring spatial positional encoding is maintained. For audio, the E2B and E4B models incorporate a 305 million-parameter conformer and an audio tokenizer, enabling advanced speech recognition and translation by processing audio embeddings rather than raw tokens.

In conclusion, Gemma 4 represents a substantial leap in open-source AI, pushing the boundaries of performance, efficiency, and accessibility across a spectrum of model sizes and modalities. Developers can get started with Gemma 4 today by downloading and self-hosting all models via Hugging Face, Kaggle, or Ollama, or by accessing the larger 31B and 26B models through Google Cloud-hosted options like AI Studio and Vertex AI for prototyping and deploying advanced agential workflows.