Inception Labs’ Mercury 2 AI beats Google’s DiffusionGemma at its own game



short

  • Inception Labs’ Mercury 2 produces nearly 1,000 tokens per second and scores 90 at AIME 2026
  • Google’s latest DiffusionGemma achieves similar speeds but performs worse in benchmarks.
  • DiffusionGemma is free and open weight on Hugging Face. Mercury 2 is an API driven and weight locked model.

Inception Labs introduced Mercury 2 on Thursday, calling it the world’s fastest thinking language model. According to the company’s announcement, it generates about 1,000 tokens per second — the pieces of text that the AI ​​model reads and writes — versus roughly 89 tokens per second for Anthropic’s Claude Haiku 4.5 Reasoning and 71 for OpenAI’s GPT-5 Mini.

This puts it in the same speed bracket as Google would later claim Gemma broadcast.

Both models get there by abandoning the typewriter style of writing. A standard chatbot types one word, checks what it just typed, then types the next word, and repeats until the answer is finished. Instead, diffusion models fill a block of text with random placeholders and clear out the noise via a set of parallel passes—the same trick that turns a still image into an image in image generators like Stable Diffusion—until the entire block is locked into the final response at once.

Where the two diverge is what survives the process. In AIME 2026 – built from real American Invitational Mathematics Test problems and recorded as the percentage solved correctly – Mercury 2 reached 90%. Google tested DiffusionGemma on the same set, scoring 69.1%, while the standard non-diffusion Gemma 4 scored 88.3% on the same test.

At GPQA, the doctoral-level science benchmark scored the same way, with the two models roughly tied: Mercury 2 at 77% versus 73.2% for DiffusionGemma. But Google’s developer guide recommends the standard Gemma 4 for apps that require maximum quality, acknowledging that DiffusionGemma lags behind it across the board.

The speed claim also applies outside the laboratory. Augment Code, an AI coding company, replaced Mercury 2 with Anthropic’s Claude Opus 4.7 on its context compression subproxy and saw an 82% reduction in latency and a 90% reduction in cost, while reporting the same output quality, according to Common case study.

Inception was built on research by its founder Stefano Ermon, a Stanford University professor who co-authored some of the results-driven publishing techniques that power today’s image generators. The startup’s $50 million funding round received backing from Nvidia’s venture arm and individual investors Andrew Ng and Andrej Karpathy.

For non-technical users, the big thing that most people don’t notice until they feel it is “flow.” Traditional models make you wait between ideas in a long session. Deployment models like this make AI feel like it’s keeping up with you — instant autocompletes, quick iterations on code or plans, and sub-agents that can handle tedious, high-volume work without dragging the entire system down.

That sublayer is an interesting architectural transformation. Complex AI systems are not a giant intelligent model anymore. They’re orchestras of dedicated assistants: one for deep thinking, several for quick summarization, routing, searching for tools, checking output, and so on. Serial models make these utility calls expensive and slow. Parallel propagation tools make it cheap and fast enough to be used freely.

Realistic caveats for casual users: These are still best for high-volume, speed-sensitive parts of the workflow rather than the hardest frontier thinking of all (where the largest AR models may still have an advantage at the moment). Mercury 2 is not open weights, so it is API/cloud for now. And like Google’s version, the entire ecosystem (native runtimes, proxy frameworks) is still evolving to make it seamless everywhere.

Use cases that are instantaneous: fast real-time programming and “dynamic coding” where the model keeps up with your modifications, multi-agent coding or support systems where a lot of fast subcalls happen, voice interfaces that don’t feel the delay, and any latency-sensitive auto-completion or anticipation of the next action. At scale, the cost and energy savings resulting from increased throughput over standard devices are rapidly accumulating.

Numbers Establishment shares (And independent reviews) make the case visually: Mercury 2 falls into the “fast and good” quadrant of diffusion models, pushing what used to require exotic hardware into commodity GPUs.

Daily debriefing Newsletter

Start each day with the latest news, plus original features, podcasts, videos and more.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *