Google DeepMind Releases DiffusionGemma with 4x Speed Boost for Local AI

2026-06-10 19:39:38

Google DeepMind released DiffusionGemma, a new member of the Gemma 4 open model family that generates text through parallel processing rather than sequential token generation. The model achieves faster and more efficient performance on local hardware including Nvidia DGX systems and consumer gaming GPUs. Unlike autoregressive models that produce text left-to-right one token at a time, DiffusionGemma uses a diffusion-based approach similar to image generation models, starting with placeholder tokens and refining them across multiple passes to produce entire text blocks simultaneously. This architectural shift enables approximately four times the output speed of similarly sized autoregressive Gemma models while fitting within the memory constraints of high-end consumer GPUs.

DiffusionGemma Uses Diffusion-Based Architecture for Parallel Text Generation

Most AI models are designed to be autoregressive, generating text left to right one token at a time. DiffusionGemma has more in common with image generation models, which start with static and then denoise it to create the desired content. This model takes a field of placeholder tokens running over the canvas multiple times to generate likely tokens and using those to improve estimation of others. At the end of the process, the model finalizes its token outputs in one large block—the "denoised" text canvas.

DiffusionGemma is a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference. That means it should fit in the 18GB RAM allotment of a high-end GPU. This approach to text generation shifts the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel.

Model Achieves 700-1000+ Tokens Per Second Across Hardware Configurations

In testing with an RTX 5090, DiffusionGemma spits out around 700 tokens per second. With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second. That's about four times the output of the similarly sized autoregressive Gemma models.

DiffusionGemma Demonstrates Advantages in Non-Linear Task Solving

Google says this offers a measurable boost in non-linear tasks like in-line editing, molecular sequencing, and mathematical graphing. DiffusionGemma was tuned to solve Sudoku puzzles, which is a notoriously challenging task for standard autoregressive AI models because each token depends on future tokens. DiffusionGemma's ability to continuously self-correct large sets of tokens makes that easier.

FAQ

What is DiffusionGemma and how does it differ from other AI models?

DiffusionGemma is a new open AI model from Google DeepMind that uses a diffusion-based architecture to generate text in parallel rather than sequentially. Unlike autoregressive models that produce text one token at a time from left to right, DiffusionGemma starts with placeholder tokens and refines them across multiple passes, finalizing entire text blocks simultaneously similar to how image generation models denoise static into coherent images.

How fast is DiffusionGemma compared to other Gemma models?

DiffusionGemma produces around 700 tokens per second on an RTX 5090 GPU and over 1,000 tokens per second on a single Nvidia H100 AI accelerator. This represents approximately four times the output speed of similarly sized autoregressive Gemma models while fitting within the 18GB RAM allocation of high-end consumer GPUs through its Mixture of Experts architecture with 26 billion total parameters and 3.8 billion activated during inference.

What types of tasks does DiffusionGemma perform better at?

Google states that DiffusionGemma offers measurable performance improvements in non-linear tasks including in-line editing, molecular sequencing, mathematical graphing, and solving Sudoku puzzles. The model's ability to continuously self-correct large sets of tokens makes it particularly effective for tasks where each token depends on future tokens, which are notoriously challenging for standard autoregressive AI models.

View Source

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.