Google launches Gemini 3 with a coding app and a 37.4 Humanity's Last Exam score
Google has launched Gemini 3, its latest flagship model, and the benchmark numbers are big. The company says Gemini 3 scored 37.4 on Humanity’s Last Exam, ahead of the previous top mark from GPT-5 Pro at 31.64. It also says the model now leads LMAren...
Google’s Gemini 3 looks strong on paper. Antigravity is the part developers should watch.
Google has launched Gemini 3, its latest flagship model, and the benchmark numbers are big. The company says Gemini 3 scored 37.4 on Humanity’s Last Exam, ahead of the previous top mark from GPT-5 Pro at 31.64. It also says the model now leads LMArena, the human preference benchmark vendors cite when they want to show people actually prefer using their model.
That matters. For working engineers, though, the more interesting part of this launch sits next to the model: Antigravity, a new coding app that gives the agent a chat pane, a terminal, and a browser in the same workspace.
That says more about where this market is headed than another benchmark chart.
The model race is compressing
Gemini 3 arrives seven months after Gemini 2.5, one week after OpenAI’s GPT-5.1, and two months after Anthropic’s Sonnet 4.5.
Nobody at the frontier is waiting for clean annual release cycles anymore. The pitches are starting to blur too: better reasoning, better coding, better tool use, plus some kind of agent workflow wrapped around the model.
Google also has distribution that most rivals would kill for. The company says the Gemini app now has 650 million monthly active users, and 13 million developers have used Gemini in their workflow. Add Search, Workspace, Android, and Google Cloud, and Google can put a new model in front of a huge audience fast.
That doesn’t buy trust. It does buy exposure.
Antigravity is the practical part
Antigravity is a bad name for a pretty sensible product.
Google describes it as a coding interface where the model can work across your editor, terminal, and browser. That matters because a lot of coding assistants still act like they can build software from inside a chat box with only partial contact with the actual environment.
A tri-pane setup changes the loop:
- The model plans a task.
- It edits files.
- It runs commands.
- It checks the result in a browser.
- It revises based on what happened.
That’s closer to real development. You don’t judge a generated React component by how plausible it looks in text. You judge it by whether it compiles, whether the route works, whether validation fails the way it should, whether the browser renders what you expected, and whether the tests pass.
That’s why agentic coding apps keep showing up. Cursor, Warp, and a growing pile of IDE-adjacent tools are all chasing the same thing: less friction between “the model wrote code” and “the machine verified it.”
That verification loop is still where many AI coding tools fall apart. A good interface can matter more than a modest model gain.
Gemini 3’s benchmark jump probably matters
Google hasn’t said much about Gemini 3’s internals, so most outside analysis has to infer from behavior and product framing. Still, a jump this large on reasoning-heavy benchmarks usually points to work in a few areas:
- better multi-step decomposition
- stronger tool-calling discipline
- better long-context state tracking
- training that rewards intermediate reasoning quality, not just final answers
That last point matters for coding. A model that can spit out a final code block isn’t necessarily good at debugging a broken setup, tracing a build error across files, or recovering after a failed migration script. Those jobs depend on sustained reasoning under feedback.
Benchmarks like Humanity’s Last Exam are imperfect, but they do tell you something about whether a model can hold onto a chain of logic across harder tasks. For developers, that often shows up as fewer wasted detours. Less “I updated the config” when it didn’t. Fewer fake claims that tests passed when the output says otherwise.
LMArena is softer, but still useful. Human raters tend to reward systems that are coherent, responsive, and less irritating. For coding, “less irritating” counts.
Grounded tool use is still the dividing line
The technical bet behind Antigravity is straightforward: give the model real I/O and make it visible.
That usually means tools with explicit contracts, something like this:
{
"name": "execute_command",
"description": "Run a shell command with a specified working directory",
"parameters": {
"type": "object",
"properties": {
"cmd": { "type": "string" },
"cwd": { "type": "string" }
},
"required": ["cmd"]
}
}
Tool schemas like this are dull, but they matter. Models behave better when the action space is clear. A fuzzy instruction like “check the app” leads to sloppy output. A bounded action like read_file, write_file, execute_command, or navigate_url gives the orchestrator something it can validate, log, and restrict.
For Antigravity to work well, Google probably needs a session layer that keeps shared state across:
- file diffs
- terminal output
- browser state or DOM snapshots
- the model’s running task plan
That’s harder than it sounds. If state sync gets loose, the agent starts reasoning over stale files or outdated command output, and the whole thing gets brittle fast. Good agent products live or die on orchestration details that never make the keynote.
Nice demo, messy production reality
Developers should pay attention to Antigravity. They shouldn’t trust it blindly.
An agent that can touch your filesystem, shell, and browser is useful for scaffolding apps, wiring routes, fixing tests, and chewing through repetitive frontend work. It can also make a mess very quickly if the guardrails are weak.
A sane rollout path looks boring:
- restrict writes to project directories
- sandbox shell access
- block outbound network access except approved domains
- use short-lived credentials
- log every tool call, diff, and command result
- require tests before merge
- scan generated code for secrets, unsafe shell patterns, SSRF risk, and insecure defaults
If you’re evaluating Gemini 3 for team use, test it against your own work. Skip the vendor benchmark and the social clip. Use tasks that resemble your stack and your failure modes.
Measure the stuff that actually affects delivery:
- time to passing tests
- retries per task
- average diff size
- number of manual fixes
- wall-clock completion time
- incident rate after agent-written changes
That last one matters more than any leaderboard.
Google has an advantage, but no easy win
Gemini 3 looks strong. Antigravity is pointed at the right problem. Google’s hard part isn’t raw capability. It’s consistency and trust in real workflows.
OpenAI already has entrenched API usage and ChatGPT distribution. Anthropic has built a solid reputation with coding-heavy users, especially where careful reasoning and lower surprise rates matter. Cursor and similar tools have trained developers to expect the agent inside the editor, not sitting beside it. Google is pushing into all of those fronts at once.
Still, Antigravity fits the moment better than another assistant tab would have. Developers want models that can act, check, and recover. They want the terminal output visible. They want the browser result visible. They want fewer hallucinated “done” messages.
The harder part is proving that this holds up under load, across ugly codebases, and in the middle of non-demo work. Greenfield app generation is easy to show. Incrementally updating a large production monolith without breaking auth, build caching, and deployment scripts is where these systems still get humbled.
What to watch next
Antigravity on long sessions
Short agent runs are one thing. Multi-hour sessions with repeated edits, test loops, and browser checks are where context management starts to show cracks.
Tool discipline
Top coding agents need to know when to stop reading files, when to run a command, and when to ask for confirmation. Too much initiative can be as bad as too little.
Cost and latency
Tool-heavy workflows get expensive fast. If Gemini 3 needs too many back-and-forth turns and repeated file reads to finish a task, teams will notice in both bill and wait time.
Recovery behavior
The best models still fail. What matters is whether they inspect the output and recover cleanly instead of thrashing.
Google has shipped a model that looks impressive on the charts and a coding app that points in the right direction. For developers, Antigravity is the part worth testing first. If Gemini 3 can close the loop between code generation and verification with less friction than the current crop of tools, Google has something more useful than a benchmark win.
That’s the bar now. Who gets to a working build with the fewest lies along the way.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Compare models against real workflow needs before wiring them into production systems.
How model-backed retrieval reduced internal document search time by 62%.
Google has launched managed Model Context Protocol servers for Maps, BigQuery, Compute Engine, and Google Kubernetes Engine. That matters more than the product name suggests. For the past year, most “agentic” demos have run on custom glue. Teams conn...
Google’s Darren Mowry, who oversees startups across Google Cloud, DeepMind, and Alphabet, had a straightforward message for AI founders: if your company is basically a UI on top of someone else’s model, or a switchboard routing prompts between models...
Google has moved Gemini 3 Flash into the center of its AI lineup. It's now the default model in the Gemini app, it powers AI Mode in Search, and it's coming to Vertex AI, Gemini Enterprise, the API preview, and Google's Antigravity coding tool. The p...