Why Garry Tan's gstack struck a nerve in the Claude Code crowd
Garry Tan’s open source Claude Code setup, gstack, got the kind of reaction you see when a tool lands on an idea people already care about. Some developers saw a useful pattern for AI-assisted software work. Others saw a prompt folder with founder br...
Garry Tan’s gstack puts a name on how serious teams are already using coding agents
Garry Tan’s open source Claude Code setup, gstack, got the kind of reaction you see when a tool lands on an idea people already care about. Some developers saw a useful pattern for AI-assisted software work. Others saw a prompt folder with founder branding.
Both reactions make sense.
Tan published the repo on GitHub on March 12 under an open source license, starting with six reusable Claude Code “skills” and later expanding to 13. The skills define roles like CEO, engineer, reviewer, designer, and documentation writer in structured skill.md files. He’s said he uses multiple workers across projects including a rebuilt Posterous, Posthaven, and GarrysList.org. One anecdote spread quickly: a CTO friend reportedly called the setup “god mode” after it caught a security issue.
The eye-rolling was predictable. Plenty of engineers already keep their own prompt libraries, review rubrics, and role presets. Packaging a familiar workflow and giving it a name can look thin.
Still, writing it off as “just prompts” misses the part that matters. Software engineering is full of systems that look trivial until someone makes them repeatable enough for other teams to adopt. That’s the interesting part of gstack.
Why it landed
For the past year, a lot of AI coding discussion has stayed stuck in demos. Prompt for a feature, get a patch, move on before anyone asks about cleanup, review, or regressions. gstack points at a more grounded setup: role separation, handoffs, review loops, and tool-backed checks.
That lines up with how good teams already work.
Tan’s setup seems to follow a simple flow:
- A “CEO” skill evaluates an idea.
- An “Engineer” skill implements it.
- A “Reviewer” skill checks quality and security.
- Other skills handle design, product, and docs.
On paper, that can look a little ridiculous. A fake org chart inside a prompt system is easy to mock. But language models usually do better with constrained jobs and narrow output formats than with one giant instruction blob asking for product judgment, implementation, testing, and security all at once.
That’s the practical value. Smaller role definitions tend to produce cleaner outputs. They also create checkpoints where you can run real tools instead of trusting the model to catch its own mistakes.
The workflow matters more than the personas
The “CEO” label is cute and a bit much. The mechanics are the useful part.
A skill file in Claude Code is basically a structured role definition with goals, rules, allowed tools, and an expected output format. The technical pattern looks something like this:
name: code_reviewer
role: Senior Security-Focused Code Reviewer
goals:
- Identify defects, injection risks, and unsafe patterns
- Propose minimal, concrete patches
guidelines:
- Cite files and line ranges
- Stay within existing architecture
tools:
- static_analysis
- unit_tests
output_format:
- findings: list
- diffs: unified
That structure matters more than the label. It gives the model a narrower lane. It tells the orchestration layer what to send in and what should come back. It also makes the system testable, which is where a lot of agent demos fall apart.
A decent multi-agent coding loop has a few moving parts that are easy to miss if you only look at prompts.
Context assembly
The hard problem in repo-scale coding agents is still context selection. Not model cleverness. Context.
If the engineer skill gets the wrong files, stale interfaces, or missing constraints, the rest of the workflow is built on bad inputs. Good systems use repo indexing, symbol lookup, path heuristics, ripgrep, embeddings, or some combination of those. Long context windows help, but dumping half the monorepo into the prompt is sloppy and expensive.
Precision matters here.
Tool-backed review
If the reviewer skill just reads the generated patch and gives it a thumbs-up, you’ve built an expensive rubber stamp. Useful setups run tests, linters, static analysis, and security scanners alongside the model.
Think pytest, npm test, eslint --max-warnings=0, semgrep, bandit, gosec. The model should interpret those results, not replace them.
Structured outputs
Freeform prose becomes a problem the moment you want automation. If a reviewer returns JSON findings, diff hunks, line references, and a checklist, that can feed CI, dashboards, and follow-up steps. If it writes a long essay, someone has to parse it by hand.
That’s why prompt engineering is starting to look a lot like ordinary software work. Versioned files. Schemas. Evaluations. Failure cases.
Why developers are split
The skepticism is healthy.
A lot of the praise for gstack, especially from outside engineering circles, drifts into magical thinking. Running five to ten Claude workers does not give you a software team in a box. In practice, more parallel agents often create more chances for duplicated work, contradictory edits, and context drift.
And “agentic” still covers a lot of weak systems. Some are nested prompts with a nice activity log. Others break as soon as they touch a codebase with odd conventions, undocumented dependencies, or ten years of migration debris. Critics are right to ask whether this is solid engineering or packaging.
But the packaging does matter because it makes the pattern visible.
Teams have been doing versions of this privately for months: prompt directories under version control, a patch writer agent, a reviewer agent, a security pass, maybe a docs pass, all stitched together with scripts and CI hooks. Tan’s repo gives that pattern a public artifact. Once that happens, people can argue about specifics instead of vague claims.
That’s useful.
Where it actually helps
gstack-style workflows fit some kinds of work very well.
They’re good for scoped feature work, CRUD-heavy changes, tests, docs, migrations, refactors with clear acceptance criteria, and review passes over code that already has decent structure. They can also help small teams cover more ground, especially when one person is effectively acting as PM, engineer, and reviewer.
They’re much weaker when the work depends on tacit product judgment, ugly legacy coupling, performance tuning under real load, or deep domain knowledge that isn’t written down anywhere. Models can sound confident there and still be badly wrong.
That trade-off matters for technical leads. You can get real throughput gains, but only if you pick the right tasks and put hard gates around the output.
The security angle needs more attention
The security anecdote around gstack got attention for a reason. Security review is one of the areas where these role-based setups can help, because a dedicated review pass with explicit rules often catches issues that a general “build this feature” prompt will miss.
There’s also a downside. Agent pipelines create new attack surfaces.
If you’re feeding repository content, tool output, tickets, and environment data into model contexts, you need to think about:
- secret exposure
- prompt injection through repo files or issue text
- over-broad filesystem access
- untrusted tool execution
- missing audit trails on generated diffs
A serious setup should sandbox tools, stub or mask sensitive config, log every change, and treat model output as a proposal until it passes tests and human review. If the reviewer agent times out or a scanner flags something ugly, the merge should stop.
One common mistake is assuming an AI reviewer means less process. Usually it means tighter process, because these systems fail fast, confidently, and in ways that can be genuinely strange.
What engineering teams should take from this
The strongest signal from gstack is that prompts are turning into operational artifacts.
Teams are going to manage agent skills the way they manage code and infrastructure:
- keep role definitions in the repo
- attach owners and review policies
- add evals and failure cases
- track token cost and latency by workflow
- record what context was selected and why
- gate outputs through CI and human review
This is also where standards like Model Context Protocol (MCP) matter. If tool and data access gets more portable, these workflows become easier to move across vendors and internal systems. That won’t fix quality on its own, but it does remove one obvious headache: rebuilding the plumbing for every model stack.
For senior engineers, the takeaway is straightforward. If your team is still treating AI coding as a chat window, you’re in the toy phase. The useful work starts when roles are defined, outputs are constrained, tools are wired in, and failure patterns are measured.
Tan’s repo didn’t invent that pattern. It did make it harder to ignore.
And yes, people will keep joking that it’s “just prompts.” Software has a long history of advancing through abstractions that look obvious in hindsight but still save time, standardize behavior, and hold up in real systems. gstack looks like one of those.
Useful next reads and implementation paths
If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.
Add engineers who can turn coding assistants and agentic dev tools into safer delivery workflows.
How an embedded pod helped ship a delayed automation roadmap.
Spotify says some of its best developers haven’t written code by hand since December. Normally that would read like stage-managed exec talk. The details make it harder to dismiss. The internal setup, called Honk, lets engineers ask Claude Code from S...
Anthropic is bringing Claude Code into Slack as a research preview. That matters because a lot of engineering work starts in chat long before anyone opens an editor. The pitch is simple. Mention @Claude in a Slack thread, point it at a repo, and the ...
Anthropic has added weekly rate limits to Claude Code on top of the existing five-hour caps, and for heavy users that changes the product in a meaningful way. The new setup has two quota buckets: - a weekly overall usage limit across models - a model...