Generative AI July 18, 2025

OpenAI launches ChatGPT Agent for multi-step planning, tool use, and app actions

OpenAI has launched ChatGPT Agent, a general-purpose agent mode inside ChatGPT that can plan multi-step tasks, use external tools, run code, browse the web, and take actions across connected apps including Gmail, Google Calendar, GitHub, Slack, and T...

OpenAI launches ChatGPT Agent for multi-step planning, tool use, and app actions

OpenAI’s ChatGPT Agent puts planning, tools, and execution in one place

OpenAI has launched ChatGPT Agent, a general-purpose agent mode inside ChatGPT that can plan multi-step tasks, use external tools, run code, browse the web, and take actions across connected apps including Gmail, Google Calendar, GitHub, Slack, and Trello.

Plenty of AI companies have promised some version of this. OpenAI’s move is to put three separate ideas into one interface: Operator-style browser control, Deep Research-style synthesis, and terminal-backed code execution. For developers and technical leads, that matters more than the name. The model is being positioned as a workflow coordinator, not just a chatbot with tools attached.

The pitch is straightforward. Ask it to summarize recent customer issues from Gmail, open related GitHub tickets, pull context from Slack, and draft a slide deck for the next incident review.

Whether it can do that reliably enough for real use is still the hard part.

What shipped

The new mode lives inside ChatGPT. Users switch into agent mode from the tools menu, describe a task in natural language, and the agent breaks it into sub-tasks, picks tools, executes actions, checks the results, and retries when something fails.

OpenAI says the system combines:

  • web navigation and UI interaction from Operator
  • broad source synthesis from Deep Research
  • sandboxed code execution and API access via a terminal

That combination matters. A lot of so-called agents still break at the seams. They can reason but not act, act but not recover, or call tools and then stumble when the output gets messy. OpenAI is trying to patch that with a planner-executor setup that looks closer to real automation than the usual demo loop.

The feature set is practical:

  • email and calendar automation through Gmail, Outlook, and Google Calendar connectors
  • editable presentation generation
  • code execution for debugging scripts and running tests
  • third-party integrations with tools like GitHub, Trello, and Slack

That puts it squarely in range for software teams now.

Strong benchmarks, with the usual caveats

OpenAI reports 41.6% pass@1 on Humanity’s Last Exam and 27.4% on FrontierMath with tool access.

Those are strong numbers. Humanity’s Last Exam is broad, difficult, and built to resist cheap pattern matching. FrontierMath is a useful test of reasoning under pressure, especially with tools in the loop. A 41.6% pass@1 score points to a real jump in long-horizon problem solving.

It still says very little about your actual environment. Benchmarks won’t tell you how the agent handles an internal wiki full of stale pages, a messy Jira setup, or a GitHub repo held together by old build scripts. Tool use helps the model fetch, calculate, and verify. It also creates new failure modes: expired auth, connector bugs, ambiguous UI states, partial results, and retry loops that burn time and money.

The scores matter. They just don’t settle the deployment question.

Why the architecture matters

The interesting part of this release is the modular architecture OpenAI describes.

There’s a planner module that breaks requests into smaller steps and prioritizes actions using cost factors like latency and token usage. A tool router dispatches connector calls, spins up sandboxed code execution, and coordinates browser automation. An execution monitor validates outputs, catches failed steps, and hands control back to the planner when the workflow needs repair.

That’s a sensible design. It also shows that agent systems are finally getting a little less naive.

A year ago, a lot of “autonomous agents” were basically an LLM loop with a tools array and misplaced confidence. Fine in demos. Messy in branching tasks. They repeated themselves, wandered off, or quietly claimed success. A planner plus execution monitor is a meaningful step toward reliability because it treats verification and recovery as core parts of the system.

OpenAI also says the model was fine-tuned on tool-use demonstrations and reinforced through self-play, where the agent attempts tasks, learns from failures, and iterates across simulated runs. That lines up with the product. Better tool coordination requires examples of good sequencing and signals for when the sequence breaks.

The cost-aware planning detail stands out too. If the planner is genuinely weighing API latency and token spend, that suggests agent systems are being shaped around operational constraints, not just raw reasoning scores.

Security is where this gets real

OpenAI says the agent includes a safety layer that screens for biological and chemical weapon content, enforces policy in real time, and disables memory to reduce prompt-injection-driven data exfiltration.

That memory decision matters. Persistent memory is useful in a personal assistant. It also widens the attack surface when the model can read untrusted content and act across connected systems. Once an agent can browse the web, open emails, inspect docs, and send data elsewhere, prompt injection stops being a theory problem.

Anyone evaluating this should assume three things:

  • connectors increase the blast radius
  • browser automation is fragile and easy to manipulate
  • “sandboxed” code execution still needs hard boundaries and logging

OpenAI’s system card and safety report matter here, especially with the product described as “high capability.” In regulated environments, the built-in safeguards won’t be enough. You’ll still want audit trails, tight permission scopes, and approval gates for actions that have external effects.

What to care about first

For teams, the first question is where this is worth trusting.

The safest starting point is bounded, annoying work with clear outputs. Good examples:

  • research summaries across a fixed source set
  • triaging GitHub issues into labels or draft responses
  • generating a slide outline from meeting notes and issue history
  • running tests in a sandbox and summarizing failures
  • collecting change context from Slack, tickets, and commits before a review

These are useful because they save time without giving the agent too much authority. They’re also easier to evaluate. You can compare the output with what a human would have assembled and see exactly where it slips.

The risky cases are obvious:

  • autonomous email handling with broad inbox access
  • calendar changes across multiple stakeholders
  • production repo actions with write permissions
  • API workflows that touch customer or billing data

That’s where governance stops being optional. OAuth scopes should stay narrow. RBAC should be explicit. Sensitive fields should be redacted before prompting when possible. And someone needs to watch token consumption, because multi-step execution can get expensive fast.

That cost piece is easy to miss. One innocent “do this for me” request can fan out into browsing, connector calls, code execution, validation, and retries. Fine when it replaces an hour of tedious work. Less fine when a shell script or internal integration could do the job more predictably.

The broader shift

OpenAI’s strongest move here is product integration. A lot of agent tooling still feels like a starter kit for demos. ChatGPT Agent feels like an attempt to make this the default interface for getting work done.

That has consequences for SaaS vendors. If users expect one agent to read email, query GitHub, inspect docs, and update tasks, narrow single-app copilots start to look cramped. Vendors with strong connectors and clean permission models have an edge. The rest risk becoming plumbing behind someone else’s agent.

For engineering teams, the message is simpler. We’re moving from “LLM as chat interface” to “LLM as workflow coordinator.” That raises the ceiling. It also raises the operational burden. Reliability, observability, auth, auditability, and cost control all move closer to the center.

ChatGPT Agent looks like one of the clearest attempts so far to make that trade worth it. I’d treat it as a serious automation layer for bounded tasks. I wouldn’t hand it messy business processes and walk away. It’s ahead of a lot of agent hype. It still needs adult supervision.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
AI model evaluation and implementation

Compare models against real workflow needs before wiring them into production systems.

Related proof
Internal docs RAG assistant

How model-backed retrieval reduced internal document search time by 62%.

Related article
OpenAI o3 and o4-mini shift from reasoning models to tool-using agents

OpenAI’s latest model release matters because o3 and o4-mini look better at doing work, not just describing how they’d do it. The headline is tool use. These models can call Python, browse, inspect files, work through codebases, and handle images whi...

Related article
OpenAI opens ChatGPT app submissions and expands in-product app discovery

OpenAI has opened submissions for a ChatGPT app directory and is rolling out app discovery inside ChatGPT’s tools menu. Its new Apps SDK, still in beta, gives developers a formal way to plug services into ChatGPT so the model can call them during a c...

Related article
OpenAI inside ChatGPT raises a harder question for Apple's AI strategy

OpenAI’s move to let third-party apps run inside ChatGPT brought back an old idea: the app icon may not matter much if one assistant window can handle travel, playlists, shopping, and work. If that shift sticks, the home screen stops being the main w...