The person who leads Claude Code at Anthropic recently said he barely writes prompts anymore. He writes the loops that drive the AI, and lets the system work on its own. It sounds like something only a Silicon Valley insider could pull off. It isn't. You can do this work from anywhere, and so can the person you trust to guide you through it.
I do it from Indiana. Most nights, after the calls and the meetings, I am in what I call the sandbox: building, testing, breaking, and rebuilding small AI systems to learn what holds up. This is a look inside mine, and a way to judge whether the advisor you are listening to is in their own sandbox or only reading about everyone else's.
The hard skill keeps moving
A short history, because the recent quote only makes sense against it.
Two years ago the prized skill was writing a good prompt. Then people learned that the prompt was the small part, and the real work was feeding the model the right context, the documents, the examples, the memory. By mid-2025 that had a name, context engineering, after posts from builders like Tobi Lutke and Andrej Karpathy, and Anthropic later wrote it up formally.
Underneath all of it sat another idea. Back in December 2024, Anthropic had already described an agent as a model using tools in a loop: do a step, check the result, decide the next step, repeat. So when the head of Claude Code, Boris Cherny, described moving away from writing prompts and toward writing the loops that run the model, he was naming where the hard work had moved, not announcing something out of nowhere. (His comments came on the Acquired Unplugged interview; I am paraphrasing his point rather than quoting him word for word.)
The through-line: the skill that pays keeps climbing one rung higher. Write a prompt. Engineer the context. Build the agent. Now, design the loop the agent runs inside, and the check that catches it when it gets something wrong.
The sandbox is not fenced off
That history should change how you feel about all of this. None of that work requires a job at a frontier lab or a zip code in the Bay Area. The tools are open. The models answer to anyone who shows up with a question and the willingness to try, fail, and try again.
You do not need permission to play in the AI sandbox. You need curiosity and reps. That is the whole entry fee.
What my sandbox looks like
So you can judge the claim, not just take it, here is what I actually run. I work with three named AI agents, each with a job:
- Barnabas is built on Claude. It is the operating brain of my company. It does not work alone. Barnabas coordinates eleven specialist agents under it, each with one focus: one runs my content engine, one handles sales support, one acts as a CFO, others cover legal, PR, analytics, and operations. Together they run most of the day-to-day work and help me build the website.
- Ezra is built on Codex. It takes the heavier, more complex coding and platform builds.
- Silas is the newest, and the one I am most interested in right now. Its only job is to audit the other two. I built it to be model-agnostic, so I can run the same loop across very different models and see what holds up. This week I am testing it across Claude Opus 4.8, Claude Sonnet 4.6, GPT-5.5, Kimi K2.6, Qwen3-Coder-Next, and DeepSeek R1.
I set Silas up this week, so I have no findings to report yet. The question I am chasing: how much of a good result is the loop, and how much is the model? If a well-built loop holds up across all six, then the loop is the durable asset and the model underneath is swappable.
Notice the part that is easy to skip. The agent I care about most has one job: checking the work of the other two. Plenty of setups stack agents that write, code, and ship. The skill that keeps the whole thing trustworthy is verification, so I gave verification its own agent on purpose, on a different model, so its judgment stays independent of the model it is reviewing.
There is evidence behind that instinct. A 2025 study testing an agent built to judge other agents found it disagreed with the human-majority verdict only about 0.3 percent of the time, against roughly 31 percent for a single model acting as judge. The same paper is blunt that human oversight stays essential. That is the rule I build around: automate the work, keep a human on the part that decides.
How to tell if an AI advisor is any good
If you are choosing someone to guide your company through this, you do not need to grade their slides. You need to know whether they are in the sandbox or narrating it from the stands. Five questions sort that out fast:
Do you build with these tools every day, or talk about them?
The honest answer is specific and recent, not a list of vendors they have read about.
Which models do you actually run?
Someone in the work can name them and tell you where each one is strong or weak.
What have you shipped with AI that real people use?
Not a demo. Something live, with users who are not them.
How do you check the AI's work before it goes out?
If there is no verification step, there is no system, only output.
What happens when the model you built on changes?
A good answer means they designed for it. A blank look means they did not.
A guide who is in their own sandbox passes these without blinking, because they live the answers. That is the whole point of asking.
Why this is the kind of guide you want
What the daily practice does for you as a client is the reason any of this is worth your attention.
The testing carries across accounts. I prove a loop once, harden it, and the knowledge plus the reusable patterns reach everyone I work with. The system that runs my content is the reference build every client copies. So the sandbox behaves like capital that every client draws on.
It also lets me make a promise most cannot. Because the same loop is tested across many models, your AI system is model-agnostic. It will not break when one model is retired, and it can move to a better or cheaper model when one shows up. Your agent outlives any single model. An advisor who only ever built on one tool, and never tested the loop apart from it, cannot say that and mean it. That is the heart of the way I work with owners and executives.
And it is the core of the job. A guide has to be out ahead on the trail to keep anyone from being left behind. The people who feel behind on AI do not need a guru who arrived. They need someone still climbing, a step or two ahead, who has walked the next stretch and can tell you where the footing is.
Anyone can start
I am not at the top of this. I am still early, and working to get better at it every day. That is the honest version, and it is also the invitation.
The sandbox is where the climb begins, not a place you reach fully formed. We measure that climb with The 7 Levels of AI Proficiency, a way to see where you stand today and what the next rung looks like. Level 1 is simply being aware. You do not start at the top, and you do not need to.
You do not need Silicon Valley. You need a tool, a real problem from your own work, and the nerve to try something tonight that might not work. That is the entry to the AI sandbox, and the door is open from wherever you are sitting.
Sources
- Building effective agents. Anthropic. December 19, 2024.
- Effective context engineering for AI agents. Anthropic. September 29, 2025.
- Boris Cherny on Acquired Unplugged. Acquired, presented by WorkOS. 2026.
- Agent-as-a-Judge: Evaluating Agents with Agents. arXiv 2508.02994. 2025.
- Claude models overview. Anthropic. Accessed June 8, 2026.
Frequently Asked Questions
What does writing loops mean?
A loop is a small system where the AI does a step, checks the result, decides the next step, and repeats, with a person watching the part that decides. Instead of typing one instruction at a time, you design the cycle the AI runs inside.
Do I need to be technical to build my own AI agent?
No. The entry fee is curiosity and practice, not a computer science degree. You start with one real task from your own work and one tool, and you build up from there.
What does model-agnostic mean, and why should I care?
It means the system is built so the model underneath can be swapped without breaking the work. You care because models change constantly. A model-agnostic setup means your AI keeps working, and can move to a better or cheaper model later.
How do I choose an AI advisor?
Ask whether they build with these tools daily, which models they run, what they have shipped that real people use, how they verify the AI's work, and what they do when a model changes. Someone in the work answers all five with specifics.
Find your AI Proficiency level
The free 7 Levels of AI Proficiency assessment places you across seven stages of AI capability. Under ten minutes. Research-backed scoring.