The Machine at the Keyboard: The Judgment You Can't Delegate to an AI Coder
An agentic coder types faster than you, in languages you don't know, for hours while you sleep. None of that makes it the engineer. A senior engineer's field guide to directing the machine — why it guesses, what context actually costs, and why the judgment that matters most is the judgment about what it didn't do.
An agentic coding tool can type faster than you, in languages you don't know, for hours while you sleep. None of that makes it the engineer. After two decades of building software at GeekyAnts (https://geekyants.com), I am convinced of one thing about this moment: the tool did not remove the work. It relocated it — out of your fingers and into your judgment. And the better your judgment, the more leverage you get.
Every few weeks I teach a cohort of our engineers. Lately the sessions have circled one idea so insistently that I want to write it down plainly. It is the thesis of everything we are learning about working with these machines:
The machine is the hands. You are the head.
That sounds like a slogan until you understand why it is true — and the why starts with what the machine actually is underneath the fluent voice.
The Machine Guesses
A language model does not calculate. It predicts. Given the text so far, it produces a probability distribution over what should come next, and samples from it. This single fact is the root of almost everything that follows.
It helps to hold two ways of producing an answer in your head. A deterministic process follows a known procedure and is guaranteed correct: long multiplication, a hash function, a sort. Run it a thousand times, get the same right answer. A probabilistic process produces a likely answer informed by patterns it has seen, with no guarantee. A model is probabilistic to its core — an extraordinarily capable guesser of the next token.
This is why a raw model stumbles on arithmetic. Ask it to multiply two fifteen-digit numbers and, left to itself, it produces a result of roughly the right shape — plausible leading digits, frequently wrong — because it never computed anything. It predicted what a correct-looking answer resembles. The fix is not to train it harder on multiplication. The fix is to hand it a tool: when the task is arithmetic, the model calls a calculator, the calculator computes the exact result, and the model uses it. The model decides what to do; a tool does the part that must be exact.
The senior instinct is to notice when a task has quietly drifted from probabilistic to deterministic — and to make sure something exact stands behind the guess.
A friendly summary can be a guess; that is fine, it is its nature. A customer's invoice total cannot. Knowing which parts of your problem tolerate a guess and which demand a guarantee is the first piece of judgment this whole subject asks of you — I went deep on exactly this line, and why it is the most expensive mistake in AI engineering, in The Human and the Machine.
Context Is Not Free
Here is the part that surprises people who picture a model as a fixed block of weights answered by a quick lookup.
When a model generates text, every new token has to "attend" to all the tokens before it. To avoid recomputing the earlier ones at every step, the model stores their attention keys and values in memory and reuses them — the KV cache. The weights are frozen knowledge, shared across everyone. The KV cache is the scratch pad for this conversation, allocated fresh at runtime, and it grows with the length of the context. (If you want the full mechanics of how that scratch pad works and why it dominates the cost of long context, I wrote a dedicated piece: The GPU KV Cache.) For a large model with a long window, that scratch pad can run into many gigabytes — per request, for every concurrent user, for as long as the conversation lives.
So a "one-million-token context window" is an impressive number that quietly implies a large and expensive scratch pad behind every active session.
Every token you keep in the window costs memory and bandwidth. Long context is a real engineering and cost constraint, not a headline figure.
The day-to-day bottleneck in serving these models is rarely "is it smart enough." It is moving data. Generating text one token at a time is memory-bandwidth bound: to produce each token, the hardware streams the relevant weights and the KV cache out of memory and through the compute units. The arithmetic is fast; the data movement is the limit. You cannot prompt your way past a bandwidth wall. You can only move less data — fewer tokens, lower precision, caching what repeats. When you use a hosted agentic tool you never see this wall directly. But it explains everything you do feel: why larger context costs more, why a query over a huge repository slows down, why "just feed it the whole codebase" is never free. The constraint did not vanish. It moved behind a bill.
How It Got Good — Tools, Not Genius
The single biggest reason modern AI became genuinely useful is not that the models got smarter. It is that they learned to use tools.
Early on, asking a model for JSON would often return something shaped like JSON — a trailing comma, an unquoted key — because "looks like valid JSON" is not "is valid JSON." The fix was not to make the model love commas more. It was to add mechanisms that guarantee well-formed output and train the model to use them. Generalize that and you have the modern pattern: a tool is any deterministic capability the probabilistic model can invoke — a calculator, a code interpreter, a file writer, a web search. The model's job narrows to the thing it is genuinely good at: understanding intent and choosing which tool to call.
Much of what feels like a model's brilliance is really the tools and orchestration around it — reasoning, tool-calling, structured output, retrieval — cooperating behind one reply.
An agent is just a model given tools, a goal, and some autonomy to pursue it. An agentic coding tool is exactly that idea aimed at software: a model wrapped in tools to read files, write files, search a codebase, run commands, and drive a browser. Seen this way, it is not magic. It is something you can reason about and direct. Which is the whole point.
The Vague Statement Is Dangerous
Now watch what happens when you ask for too little.
Tell an agent "build a to-do application" and two things happen, both instructive. A good modern agent will often pause and ask clarifying questions — what framework, what database, which features. That is genuinely recent behavior, and it rewards engineers who can answer well. But give it a few thin answers and it will cheerfully build something functional that no human would actually want to use — say, a working backend API with no interface and no authentication, because nothing in the instruction said "a human will open this in a browser" or "users must log in." It may even write and run its own tests against that API and report, correctly, that everything passes.
Everything does pass. The agent did exactly what it was told. The problem was never the agent. It was the instruction.
An agentic coder is a fluent producer of underspecified wishes. It turns vagueness into running, tested, confident code so fast that motion is easily mistaken for correctness.
The old engineering lesson arrives in new clothes: the vague statement is dangerous. It was true when a human read your requirements. It is more true now, because the machine will build your ambiguity faster than you can review it.
So do it properly. A precise instruction is not a wish; it is a specification. A good one reads something like: Build a to-do app. Node and Express. Postgres. Set it up with Docker. Write the tests. React on the front end. Email-and-password, session-based auth. Schema via Prisma migrations. Service-repository pattern. Look at what that is made of — every phrase is a decision taken back from the agent and made by you. "Use Postgres" chose the database. "React on the front end" declared there is a human at the other end. "Session-based auth" specified the model instead of leaving it to chance.
The more precise the instruction, the better the output — where "precise" means more decisions made, not more words.
A long, waffly prompt that still never says where the data lives is no better than a short one. (If you want a repeatable structure for loading that precision into a prompt, I lay one out in The 3D Prompting Framework.) And precision has a sensible stopping point: you need not control every atom, only decide consciously which decisions are yours and which you are content to delegate. That conscious line is the engineering. The corollary is humbling — you can only specify what you understand. Where your instruction is vague, it is usually because you do not yet know what the decision is. That gap is the thing to go close.
The Judgment You Can't Delegate: What It Didn't Do
Here is the discipline that separates a senior from everyone else in the room.
A precise build can come back genuinely good — React front end, Postgres in Docker, migrations, session auth, passing tests. The temptation is to call it done. The discipline is to ask what is still missing. Because the things an agent does not do are invisible exactly when you are most pleased with it.
Think about what a build does not address unless told: where the data actually lives and who can reach it; how it deploys and into what environment; whether it serves ten users or ten thousand; what happens on export, backup, and deletion; rate limiting; input validation; behavior under load; observability. None of these are bugs. The agent did not fail. They simply lived outside the instruction, and an agent builds only what is inside the instruction.
This is why "it works" and "it is safe to depend on" are different claims — and the gap between them is filled entirely by questions the agent will never ask for you. Where exactly will this be used? What is the deployment environment? Who are the users? Do we even need this part?
The judgment you cannot delegate is the judgment about what is absent. The machine generates the present; you are responsible for what is missing. Anyone can produce a to-do app. Knowing it has no authentication, no defined data home, and no deployment story — and caring — is the engineering.
Architecting the Machine's Memory
If you take one practical habit from all of this, take the way you manage what the machine knows before it starts.
An agent that forgets everything between sessions has to be re-onboarded every morning. The fix is a memory file — in Claude Code it is a CLAUDE.md the agent reads into context before doing anything else. It is the standing brief: what the project is, which patterns it uses, the folder structure, the decisions already made, the things never to do. A useful habit is to let the agent maintain it — after it sets up or substantially changes a project, ask it to write that file with the current state. And commit it to the repository. If the project's memory lives on one laptop, every teammate who clones the repo runs an amnesiac agent that has lost the conventions.
Memory is useful until it grows fat. On a real project that file keeps accumulating until it is hundreds of lines, and every session loads all of it — costing context, and diluting the agent's attention with material irrelevant to the task. So split it: a lean top-level file with the essentials and an index, the detail in separate files — deployment, architecture, conventions — pulled in on demand when the work touches them.
You are no longer writing a memory file. You are designing a memory system — deciding what the machine knows up front, and arranging for it to learn the rest exactly when it needs to.
This is, in effect, a retrieval system built on your own filesystem — the same instinct, scaled up, that turns a lone prompt into a durable working environment for an agent, which I argued for in Your Agents Need a Company, Not Just a Prompt. And it scales further. Sub-agents are separate workers with their own context windows — delegate the noisy investigation, get back only the summary, keep your main context clean. Effort levels let you buy deliberation on hard problems: more reasoning tokens spent before the agent acts, not more agents. Plan mode makes it write the steps before touching code. Hooks bake a team's discipline into the workflow — "after editing any file, run the tests" — so good practice stops depending on you remembering to ask. Each of these is a lever for pushing your judgment upstream, into the brief, so the machine's tirelessness executes it downstream.
The Team and the Machine
When several people and one agent share a repository, the question becomes how to share the machine without descending into chaos — and the answer turns everything above into a way of working.
Before anyone writes a feature, design the knowledge base. Each role owns the memory for its domain — testing conventions, schema rules, architecture — written down where the agent will read it. The precise instruction becomes the agent's brief. The committed, split memory becomes the team's shared context, so no one's agent goes amnesiac. Plan mode, hooks, and sub-agents become the choreography.
And the discipline that protects all of it is the oldest one we have: pull requests and code review. With an agent that generates code faster than anyone can read it, the team lead's most important job is no longer writing code. It is reviewing it — confirming the project stays in the agreed structure, that no stray framework got injected, that the build is going the direction it was designed to go. Permission-skipping "dangerous" mode is for a throwaway sandbox. On shared work, follow the normal flow: open a PR, have someone review, then merge.
The machine made code cheap. That makes judgment about code the scarce and valuable thing — and review is where that judgment is applied.
Closing Thought
Step back and the argument is simple. Underneath, the machine is a probabilistic guesser that cannot multiply without a tool, thinks in a costly scratch pad of memory, and became useful not by growing wiser but by being wrapped in tools and assembled from specialists. Put it at the keyboard and it will turn a one-line wish into running, tested, incomplete software in minutes — and the engineering reappears, untouched, in everything it did not do.
The machine relocated the work; it did not remove it. The typing left your fingers. The judgment did not leave your head. Every lever — the precise prompt, the committed memory, the split context, the effort dial, sub-agents, plans, hooks — is a way of pushing your judgment upstream so the machine can execute it downstream. The better your judgment, the more leverage you get. The vaguer it is, the faster you ship something dangerous. This is the same divide I keep returning to between AI-native and AI-assisted organizations: one drives the machine with judgment, the other just hopes.
The machine is the hands. You are the head. Keep it that way on purpose.
Reach out at pratik@geekyants.com or geekyants.com.