What is the key innovation behind Mercury?

Mercury uses a diffusion-based inference pattern, refining full code sequences in parallel rather than generating tokens sequentially.

How does diffusion inference work for code generation?

It starts from a noisy representation of the entire sequence and iteratively denoises it over multiple passes.

What advantages do diffusion models offer for code tasks?

They can maintain consistency across large outputs and potentially generate multi-file patches faster by parallelizing computation.

Llm November 8, 2025

Inception raises $50M to test diffusion models against autoregressive code LLMs

Autoregressive language models still dominate. Inception is betting that for some coding workloads, they’re also the bottleneck. The startup, led by Stanford professor Stefano Ermon, has raised $50 million in seed funding from Menlo Ventures, with Ma...

Inception’s $50 million pitch: code models that don’t wait for the next token

Autoregressive language models still dominate. Inception is betting that for some coding workloads, they’re also the bottleneck.

The startup, led by Stanford professor Stefano Ermon, has raised $50 million in seed funding from Menlo Ventures, with Mayfield, Innovation Endeavors, Microsoft’s M12, Snowflake Ventures, Databricks Investment, NVentures, Andrew Ng, and Andrej Karpathy also participating. Alongside the round, it’s rolling out a new version of Mercury, a software model that uses diffusion instead of standard next-token generation.

The pitch is straightforward. Code generation and long-context text may work better with a different inference pattern: refine a whole sequence in parallel over several steps instead of generating one token after another.

If that works outside company benchmarks, some current assumptions about LLM serving start to look pretty shaky.

Why this matters now

Most production coding systems still live with the same basic limit. They decode sequentially. Speculative decoding helps. So do better batching and tuned KV cache pipelines. But the model still moves token by token. That’s fine for chat. It gets awkward for repo-wide edits, long patches, or any task that needs to emit a lot of code while keeping the whole thing consistent.

Inception says Mercury can run at more than 1,000 tokens per second because it’s built for parallelism. Benchmark conditions matter, and startups are selective about framing, so that number should be treated carefully until someone compares quality, latency, and cost side by side. Still, the basic idea is credible enough to pay attention to.

Diffusion models don’t ask for the next token each step. They start from a noisy representation and iteratively denoise the full sequence. That pattern is familiar in image generation. In code and text, it’s harder, but the appeal is obvious. GPUs are very good at large parallel matrix operations. They’re worse at long chains of dependent token steps.

That’s the opening Inception is going after.

What diffusion changes for code

For developers, the practical difference is simple enough. An autoregressive model writes linearly while trying to keep the whole repo in its head. A diffusion model keeps revising a draft of the full answer until it settles.

That changes the cost profile. For long outputs, cost is tied less tightly to output length and more to the number of refinement steps. If a model can produce a useful multi-file patch in 8 to 32 denoising passes, it may beat sequential generation on both speed and cost.

Code is a good test case because structure matters. Rename an interface, update imports, change a call signature, touch tests, and local fluency stops being enough very quickly. You need consistency across files. A diffusion-style model has a plausible edge there because it can work on the whole candidate sequence at once and fix mismatches earlier in the process.

That’s still theory. But it lines up well with tasks like:

multi-file refactors
repository-wide API migrations
large generated diffs
code transformation based on dependency graphs or symbol relationships
long-context documentation or config generation tied to existing code

The case is weaker for short interactive completions, where streaming matters more than raw throughput.

The technical bet

There isn’t one standard diffusion LLM architecture yet. The field is still unsettled. But the ingredients are familiar.

For text and code, teams usually work with discrete or latent representations rather than image-style noise over pixels. The denoiser is often a Transformer or DiT-like model that updates a full sequence representation over several steps. Related approaches include D3PM, MaskGIT-style masked parallel generation, flow matching, rectified flow, and latent diffusion methods that compress sequence information before denoising.

The central engineering question is whether K, the number of refinement steps, stays low enough to beat autoregressive decoding on real workloads. If K climbs too high, the advantage fades. If it stays modest and quality holds up, the economics look much better.

Code adds another layer. Good developer models increasingly need structured context, not a pile of raw files stuffed into a prompt. That means AST signals, language server data, import graphs, test context, type information, maybe static analysis feedback. A diffusion model that can condition on repository structure has a better shot at useful codebase editing than a chat-tuned model treating a repo like a very long document.

That also helps explain why Inception is starting with software development. Code is constrained, structured, and testable. It’s a better target than open-ended prose.

The limits

There are real trade-offs here.

Streaming is the obvious one. Developers are used to assistants that start producing output immediately. Diffusion models refine intermediate states, so the output can keep changing while generation is still underway. That’s fine for a final patch. It’s messier for token-by-token streaming inside an editor.

There are ways around it. A system can reveal chunks after certain denoising steps. It can use diffusion for planning and another decoder for final rendering. It can show a provisional diff and then stabilize it. All of those add complexity, and some of the appeal of chat-style coding tools comes from the feeling of immediate progress.

Determinism is another concern. Parallel refinement may help with global consistency, but it can also make behavior harder to reason about in production. Teams that care about auditability, reproducibility, or tight latency SLOs will want more than a polished demo.

And quality still decides everything. Fast output doesn’t matter if semantic correctness is weak. No serious team is wiring a model into CI-connected workflows on throughput alone. The metrics that matter are pass@k, test pass rates, compile rates, diff acceptance, regression frequency, and whether the system breaks fewer builds than it creates.

Why the investor list stands out

The round gives a decent read on where this might go.

Microsoft’s M12 and NVentures point to the infrastructure angle: cloud inference and GPU optimization. If diffusion-based sequence models need different serving patterns, relationships near the compute stack matter. Snowflake Ventures and Databricks Investment suggest something else: enterprise context systems, metadata layers, and pipeline integration. That’s a practical signal, not decoration. In coding tools, the model is only part of the product. The rest is retrieval, indexing, observability, policy, and evaluation.

Mercury will probably live or die on integration quality. Can it pull relevant symbols from a repo graph? Does it know which files are in scope? Can it propose a patch and immediately run tests, linting, and type checks? Can a team track security regressions or license risk?

That’s where the serious market is. Autocomplete demos don’t get you very far.

What engineering teams should watch

If you’re evaluating Mercury or anything like it, ignore the novelty of the architecture and ask the usual hard questions.

Start with workload shape. Diffusion has a stronger case for long outputs and repo-wide edits than for tiny inline completions.

Then look at latency. Median latency matters less than tail behavior, especially if the system runs several denoising passes over large contexts.

Quality testing should use real repository tasks:

large refactors across multiple files
dependency upgrades
framework migrations
test repair after API changes
codegen that must compile and pass checks

Serving efficiency matters too. A full-sequence system may batch well and keep GPUs busy, but it can also hit memory pressure quickly if the context strategy is sloppy. File selection, graph-aware retrieval, and attention optimizations matter a lot here.

And the usual guardrails still apply. Static analysis, secret scanning, insecure pattern detection, and license hygiene don’t get less important because generation got faster.

Pressure on the incumbents

Even if Inception never becomes a major model provider, the argument behind it is worth taking seriously.

The current LLM stack has been built around autoregressive decoding because that’s what the dominant models use. If diffusion-style systems start posting strong code benchmarks with better throughput and lower cost on long tasks, providers will have to respond. That could mean hybrid systems, new serving stacks, different pricing, and different UX for coding assistants.

It could also shift the market away from chat-heavy benchmarks and toward something more grounded: how well a model handles repository-scale work under test.

That would be an improvement.

For the past few years, coding AI has too often been judged by how convincing it sounds while generating code. In production, the harder question is whether it can touch 20 files, keep types and imports aligned, and leave the test suite green. If diffusion helps there, it deserves attention. If it mostly makes benchmark slides look nicer, the market will move on.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Web and mobile app development

Build AI-backed products and internal tools around clear product and delivery constraints.

Related proof

Growth analytics platform

How analytics infrastructure reduced decision lag across teams.

Tensormesh raises $4.5M to commercialize cross-request KV-cache reuse

Tensormesh has raised a $4.5 million seed round to commercialize a part of LLM serving that deserves more attention: cross-request KV-cache reuse. The idea is straightforward. If you run chatbots, agents, or internal copilots with long system prompts...

Meta's Llama strategy runs into benchmark scrutiny and tariff risk

Meta has two problems right now, and they’re tied together. One is credibility. Meta introduced new Llama models including Scout, Maverick, and the still-training Behemoth, then ran into questions about how some of its benchmark results were presente...

Elloe AI adds a verification layer for LLM safety and inspectable decisions

Elloe AI has a clear pitch: put a safety and verification layer between the model and the user, and make the system's decisions inspectable. That may sound like familiar guardrails territory, but Elloe is aiming at a specific spot in the stack. The c...