What score did Gemini 3 achieve on Humanity’s Last Exam?

Gemini 3 scored 37.4, surpassing GPT-5 Pro's 31.64.

What is the Antigravity coding app?

Antigravity is a coding interface that merges a chat pane, terminal, and browser for end-to-end development.

Llm November 23, 2025

Google launches Gemini 3 with a coding app and a 37.4 Humanity's Last Exam score

Q: How many users and developers use Gemini?

Gemini has 650 million monthly active users and 13 million developers.

Google has launched Gemini 3, its latest flagship model, and the benchmark numbers are big. The company says Gemini 3 scored 37.4 on Humanity’s Last Exam, ahead of the previous top mark from GPT-5 Pro at 31.64. It also says the model now leads LMAren...

Google’s Gemini 3 looks strong on paper. Antigravity is the part developers should watch.

Google has launched Gemini 3, its latest flagship model, and the benchmark numbers are big. The company says Gemini 3 scored 37.4 on Humanity’s Last Exam, ahead of the previous top mark from GPT-5 Pro at 31.64. It also says the model now leads LMArena, the human preference benchmark vendors cite when they want to show people actually prefer using their model.

That matters. For working engineers, though, the more interesting part of this launch sits next to the model: Antigravity, a new coding app that gives the agent a chat pane, a terminal, and a browser in the same workspace.

That says more about where this market is headed than another benchmark chart.

The model race is compressing

Gemini 3 arrives seven months after Gemini 2.5, one week after OpenAI’s GPT-5.1, and two months after Anthropic’s Sonnet 4.5.

Nobody at the frontier is waiting for clean annual release cycles anymore. The pitches are starting to blur too: better reasoning, better coding, better tool use, plus some kind of agent workflow wrapped around the model.

Google also has distribution that most rivals would kill for. The company says the Gemini app now has 650 million monthly active users, and 13 million developers have used Gemini in their workflow. Add Search, Workspace, Android, and Google Cloud, and Google can put a new model in front of a huge audience fast.

That doesn’t buy trust. It does buy exposure.

Antigravity is the practical part

Antigravity is a bad name for a pretty sensible product.

Google describes it as a coding interface where the model can work across your editor, terminal, and browser. That matters because a lot of coding assistants still act like they can build software from inside a chat box with only partial contact with the actual environment.

A tri-pane setup changes the loop:

The model plans a task.
It edits files.
It runs commands.
It checks the result in a browser.
It revises based on what happened.

That’s closer to real development. You don’t judge a generated React component by how plausible it looks in text. You judge it by whether it compiles, whether the route works, whether validation fails the way it should, whether the browser renders what you expected, and whether the tests pass.

That’s why agentic coding apps keep showing up. Cursor, Warp, and a growing pile of IDE-adjacent tools are all chasing the same thing: less friction between “the model wrote code” and “the machine verified it.”

That verification loop is still where many AI coding tools fall apart. A good interface can matter more than a modest model gain.

Gemini 3’s benchmark jump probably matters

Google hasn’t said much about Gemini 3’s internals, so most outside analysis has to infer from behavior and product framing. Still, a jump this large on reasoning-heavy benchmarks usually points to work in a few areas:

better multi-step decomposition
stronger tool-calling discipline
better long-context state tracking
training that rewards intermediate reasoning quality, not just final answers

That last point matters for coding. A model that can spit out a final code block isn’t necessarily good at debugging a broken setup, tracing a build error across files, or recovering after a failed migration script. Those jobs depend on sustained reasoning under feedback.

Benchmarks like Humanity’s Last Exam are imperfect, but they do tell you something about whether a model can hold onto a chain of logic across harder tasks. For developers, that often shows up as fewer wasted detours. Less “I updated the config” when it didn’t. Fewer fake claims that tests passed when the output says otherwise.

LMArena is softer, but still useful. Human raters tend to reward systems that are coherent, responsive, and less irritating. For coding, “less irritating” counts.

Grounded tool use is still the dividing line

The technical bet behind Antigravity is straightforward: give the model real I/O and make it visible.

That usually means tools with explicit contracts, something like this:

{
"name": "execute_command",
"description": "Run a shell command with a specified working directory",
"parameters": {
"type": "object",
"properties": {
"cmd": { "type": "string" },
"cwd": { "type": "string" }
},
"required": ["cmd"]
}
}

Tool schemas like this are dull, but they matter. Models behave better when the action space is clear. A fuzzy instruction like “check the app” leads to sloppy output. A bounded action like read_file, write_file, execute_command, or navigate_url gives the orchestrator something it can validate, log, and restrict.

For Antigravity to work well, Google probably needs a session layer that keeps shared state across:

file diffs
terminal output
browser state or DOM snapshots
the model’s running task plan

That’s harder than it sounds. If state sync gets loose, the agent starts reasoning over stale files or outdated command output, and the whole thing gets brittle fast. Good agent products live or die on orchestration details that never make the keynote.

Nice demo, messy production reality

Developers should pay attention to Antigravity. They shouldn’t trust it blindly.

An agent that can touch your filesystem, shell, and browser is useful for scaffolding apps, wiring routes, fixing tests, and chewing through repetitive frontend work. It can also make a mess very quickly if the guardrails are weak.

A sane rollout path looks boring:

restrict writes to project directories
sandbox shell access
block outbound network access except approved domains
use short-lived credentials
log every tool call, diff, and command result
require tests before merge
scan generated code for secrets, unsafe shell patterns, SSRF risk, and insecure defaults

If you’re evaluating Gemini 3 for team use, test it against your own work. Skip the vendor benchmark and the social clip. Use tasks that resemble your stack and your failure modes.

Measure the stuff that actually affects delivery:

time to passing tests
retries per task
average diff size
number of manual fixes
wall-clock completion time
incident rate after agent-written changes

That last one matters more than any leaderboard.

Google has an advantage, but no easy win

Gemini 3 looks strong. Antigravity is pointed at the right problem. Google’s hard part isn’t raw capability. It’s consistency and trust in real workflows.

OpenAI already has entrenched API usage and ChatGPT distribution. Anthropic has built a solid reputation with coding-heavy users, especially where careful reasoning and lower surprise rates matter. Cursor and similar tools have trained developers to expect the agent inside the editor, not sitting beside it. Google is pushing into all of those fronts at once.

Still, Antigravity fits the moment better than another assistant tab would have. Developers want models that can act, check, and recover. They want the terminal output visible. They want the browser result visible. They want fewer hallucinated “done” messages.

The harder part is proving that this holds up under load, across ugly codebases, and in the middle of non-demo work. Greenfield app generation is easy to show. Incrementally updating a large production monolith without breaking auth, build caching, and deployment scripts is where these systems still get humbled.

What to watch next

Antigravity on long sessions

Short agent runs are one thing. Multi-hour sessions with repeated edits, test loops, and browser checks are where context management starts to show cracks.

Tool discipline

Top coding agents need to know when to stop reading files, when to run a command, and when to ask for confirmation. Too much initiative can be as bad as too little.

Cost and latency

Tool-heavy workflows get expensive fast. If Gemini 3 needs too many back-and-forth turns and repeated file reads to finish a task, teams will notice in both bill and wait time.

Recovery behavior

The best models still fail. What matters is whether they inspect the output and recover cleanly instead of thrashing.

Google has shipped a model that looks impressive on the charts and a coding app that points in the right direction. For developers, Antigravity is the part worth testing first. If Gemini 3 can close the loop between code generation and verification with less friction than the current crop of tools, Google has something more useful than a benchmark win.

That’s the bar now. Who gets to a working build with the fewest lies along the way.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof

Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Google launches managed MCP servers for Maps, BigQuery, GKE, and Compute Engine

Google has launched managed Model Context Protocol servers for Maps, BigQuery, Compute Engine, and Google Kubernetes Engine. That matters more than the product name suggests. For the past year, most “agentic” demos have run on custom glue. Teams conn...

Google's startup chief says AI wrapper apps and model routers face a hard future

Google’s Darren Mowry, who oversees startups across Google Cloud, DeepMind, and Alphabet, had a straightforward message for AI founders: if your company is basically a UI on top of someone else’s model, or a switchboard routing prompts between models...

Google makes Gemini 3 Flash the default model across Search, app, and API

Google has moved Gemini 3 Flash into the center of its AI lineup. It's now the default model in the Gemini app, it powers AI Mode in Search, and it's coming to Vertex AI, Gemini Enterprise, the API preview, and Google's Antigravity coding tool. The p...