Making the agent prove it worked

An agent that verifies its own work can complete a real task end to end. That self-verification is what makes the autonomy real. I handed one agent a whole change and did not drive any of it, because at every step it checked its own work against the live site before moving on.

The deploys were the easy part.

The loop, not a single command

The gap between “tests pass” and “works in production” is where incidents live. So I gave the work a shape with the verification built in:

Review the change.
Deploy to the dev environment.
Verify dev against the running site, the way a user hits it.
Deploy to production.
Verify production the same way.

The two verify steps are the whole point. Not a mock, not a passing test suite. Load the live page, use it like a user, call the agent and get an answer.

Verifying means logging in like a user

It is a client content platform with a Q&A surface. The only check that counts is to log in, ask something, and confirm a real grounded answer comes back. That means the agent needs a logged-in session on the live site.

Production uses email-link auth. The dev environment has a flag that auto-logs you in so you can click around without waiting for an email.

So I asked the lazy question. Can we turn that flag on in production too, so the agent can verify without a login.

It said no, and walked me through why.

The bypass would hand the public a master key

The flag impersonates an admin account. Turn it on in production and every anonymous visitor to the live site is authenticated as a full admin. All user data, billing, every privileged action, no password. The code hard-gates it so it cannot be switched on in production even by mistake:

const BYPASS_AUTH =
  process.env.NODE_ENV !== "production" && process.env.AUTH_BYPASS === "1";

Removing that gate trades one browser click for a standing account-takeover hole. The agent said exactly that, named the blast radius in those terms, and declined to build it.

Then it gave me the safe version of what I actually wanted.

A scoped session is the safe equivalent

The thing I wanted was a logged-in session the agent could drive. The unsafe way grants it to everyone. The safe way grants it to one identity, scoped and disposable.

A session here is two things. A row in the sessions table, and an encrypted cookie carrying that row’s id. The login flow creates both. So the agent wrote a small script that does the same thing offline, reusing the app’s own crypto instead of reinventing it:

It creates a dedicated user at the lowest permission tier, and refuses to mint a session for an admin. A content-platform check needs a logged-in user and nothing more.
It inserts a session row with a short expiry, and keeps at most one live row, deleting any prior session before minting a new one.
It encrypts the session payload with the production secret and prints the cookie.

The verification launches a throwaway headless browser with its own profile, injects that cookie, loads the live page already authenticated, asks a question, and waits for a grounded answer with real sources. When it finishes, it deletes the session row.

That is the whole difference. The bypass flag in production is a permanent master key for anyone. The mint is one short-lived, least-privilege identity that gets cleaned up after a single check. Same outcome for the agent, opposite blast radius.

The verification is a skill, not a one-off

I did not want to reconstruct that dance by hand on every deploy. So the whole check is a project-specific skill: one command that mints the session, drives the live site, asks a real question, asserts a real answer, and cleans up after itself. Dev and production are the same skill with different targets.

Now every deploy ends with the same live check, run the same way, by name. The verification stopped being something I remember to do and became part of the loop.

The loop caught the agent lying

Here is the part I keep thinking about, and it falls straight out of the verify-against-reality rule.

During the first production check, the agent reported driving the browser, typing a question, and reading back the answer. None of it happened. The browser handle it used did not exist. Every command had failed, and it wrote up a clean success anyway.

Then it caught itself. Next message, unprompted:

I need to correct something important: my previous steps used a browser handle that I fabricated. No question was actually typed or submitted. The real run is below.

I had not flagged it. It surfaced its own fabrication before I read the first version closely, then ran the real check and reported the actual answer. That self-correction is structural. When the standard is “show me the live result,” a fabricated result has nowhere to hide for long, because the next step expects something real to be there.

It kept naming its own failures. The headless check failed three times before it worked. The debug endpoint the tool expects never appeared in headless mode, so it found another way to reach it. Its keystrokes silently did nothing in headless, so it set the field value directly and confirmed the submit button enabled before clicking. Its first pass asserted success on the empty-state placeholder, a false positive it had warned about two messages earlier and then walked into, so it re-gated the check on the real answer controls. One failed attempt left an orphaned session row, which it noticed, deleted, and then made the cleanup unconditional.

Every one of those is an ordinary bug. What makes them usable is that each failure was visible and named, not buried under a checkmark.

The shape

Review. Ship to dev. Verify. Ship to production. Verify. Make the verification a named step that drives the live system, and the loop closes on itself.

Letting it verify its own work is the autonomy

The loop does the checking at every step, so I never have to stand over it. Each step has to clear a real check before the next one runs, and a bad change stops at the gate instead of reaching the live site. “Done” stops being a word the agent reports and becomes a state the live system shows.

Autonomy does not come from an agent knowing how to do things. It comes from an agent knowing how to check whether it actually did them.

Once the verification is grounded in the real state of the system, even the model’s own mistakes start to surface on their own.