Control Is Easy to Claim. Proof Is Hard to Produce

A control surface nobody can verify is just a diagram. What proof actually looks like for an AI agent — and why "we have guardrails" is not evidence.

Jul 04, 2026

The next AI-agent race is not about autonomy. It is about control — and control is not one thing. It has at least four surfaces: what makes the agent start (trigger), what it is allowed to touch (permission), whether a tired human can see what it did (evidence), and what happens when it is wrong (recovery).

I believe that.

But writing it down exposes a second problem, and it is the more uncomfortable one.

You can describe all four surfaces in a slide, ship the agent, and still have no idea whether any of it holds.

A control surface you cannot verify is not control. It is a diagram.

The gap between claiming control and demonstrating it

Every serious agent vendor now has the vocabulary. Approval gates. Audit logs. Sandboxing. Role-based access. Human in the loop.

The words are correct. That is exactly the danger.

The words describe an intention. They do not describe a fact. “The agent cannot send money without approval” is a claim about the system. Whether it is true right now, on this deployment, with this integration, after this week’s config change, is a different sentence entirely.

Most teams never write the second sentence.

They design the permission surface once, in a doc, at the beginning. Then the agent gains a new tool. A prompt template changes. A webhook gets wired to a partner. A model version rolls forward. Each change is small. None of them re-checks the boundary.

So the boundary drifts, quietly, in the exact place where nobody is looking — because everyone already believes the slide.

The wrong question is: did we design controls?

The more useful question is: can we produce evidence, on demand, that the controls still do what the slide says?

Prompt injection is the attack that collapses your own surfaces

There is one attack that makes this concrete, and it is not exotic. OWASP lists it first, as LLM01: prompt injection.

The mechanic is simple and unglamorous. An agent reads content — a web page, an email, a support ticket, a file, a tool’s response. Buried in that content is text addressed to the agent, telling it to do something the user never asked for. Forward the mailbox. Change the setting. Approve the payment. Exfiltrate the record.

If the agent treats that text as an instruction instead of as data, something specific happens to the four surfaces:

The attacker now owns your trigger surface. The agent started acting because a stranger’s sentence said so.

The attacker now spends your permission surface. Every tool you granted for convenience is now available for someone else’s intent.

Your evidence surface, if you have one, is the only reason you will ever find out.

And your recovery surface decides whether “find out” happens before or after the money leaves.

Prompt injection is not really a model bug. It is the moment your own control surfaces get pointed at a goal you did not set. That is why “the model is well-aligned” is not an answer. The question was never whether the model is nice. The question is whether the boundary holds when the input is hostile.

“We have guardrails” is not evidence

Here is the skeptic’s fair objection: fine, but every team says they handle this.

They do say it. Ask to see the proof and the room goes quiet.

Not because they are lying. Because the proof does not exist in a form anyone can hand over. It lives as a belief in an engineer’s head, or as a test that ran green once in a branch that has since moved on, or as a log that only the incident team can read after it is already too late.

A demo can show that the agent refuses one obvious bad instruction. Production has to survive the ones nobody scripted. Demos reward surprise. Trust rewards the boring thing: a check you can re-run tomorrow and get the same answer.

So what would actual evidence look like? I think it has four properties, and they map directly onto the evidence surface.

Reproducible. Anyone can run the same check and get the same verdict. Not a story about a test. The test.

Binary at the point of consequence. Pass or fail, stated plainly, about a specific boundary — “a forged signature is rejected,” “this instruction embedded in fetched content does not reach a tool call.” Not a risk score a committee interprets.

Tamper-evident. The result is signed and timestamped, so a green result cannot be quietly edited into existence after the fact. If the evidence can be faked as easily as the claim, it is not evidence.

Legible to a tired human. The person holding responsibility at 2 a.m. — or the auditor six months later — can read it without a backend engineer translating. Enough of a trail to say this is what I checked, this is the verdict, this is why the action was within bounds.

Notice what is not on that list: exposing the model’s internal reasoning, logging every token, or proving the agent is safe in general. That last one matters, so let me be honest about it.

The honest limit

Verifying one boundary is not proof that the agent is safe.

You can prove that a specific webhook rejects a forged signature and still ship an agent that leaks data three tools over. Proof of a boundary is proof of *that* boundary. Anyone who sells you “verified, therefore safe” is selling the same over-claim in a new outfit.

But that is not an argument against evidence. It is an argument against pretending evidence is a certificate of virtue.

The useful framing is smaller and more repeatable: pick the boundaries where being wrong is expensive or irreversible — the ones on your recovery surface that you cannot undo — and make each of them produce evidence on demand. Then do it again when the config changes. Control is not a state you reach. It is a thing you keep re-proving, cheaply enough that you actually will.

That is the difference between a slide and a system. The slide is written once. The system is checked on Tuesday, and next Tuesday, by someone who does not have to believe you.

The real product question

Control becomes real only when proof becomes cheap.

A great agent product will not just claim it has guardrails. It will hand you the evidence, in a form you can re-run, that cannot be quietly faked, that a tired human can read. It will make you slower on purpose at exactly the boundaries where being fast and wrong is unrecoverable.

That is less cinematic than autonomy. It will not trend.

Good. Boring, verifiable, and repeatable is what “trust” actually decomposes into once you stop using the word as a feeling.

The market will keep selling control as a claim, because a claim fits on a slide.

Proof is harder. Proof has to survive being handed to someone who does not already believe you.

That is the whole job.

A practical footnote. I have been writing the longer, verifiable version of the argument above, and building the tools to go with it.

The full written treatment is a free book, AI Agent Security: A Field Manual, on Leanpub. It is organized around exactly the chain this essay circles: every threat maps to a control, expressed as a normative requirement, tied to a pass/fail verification test. Traceability is the point — the boring, checkable version of “control.”

*If you want to run one of those checks yourself, `agent-webhook-check` is a free, single-file, stdlib-only tool that verifies one narrow boundary — whether your agent’s webhook actually rejects a forged signature — and prints a plain pass/fail with the one-line fix. The fuller kit that runs the checks across the action boundary and produces the signed, timestamped evidence described here is the one paid piece. The explanation is free; the execution across the whole surface and the signed trail are what I charge for.*

- Free book: leanpub.com/agent-security

- Free tool: github.com/AOI-Future/agent-webhook-check

This is the first piece in the Security track of AOIFUTURE Dispatch. If you want the next one — one boundary, one check, one verdict at a time — you are already in the right place.

Sources

- OWASP Top 10 for LLM Applications — LLM01: Prompt Injection

- Five Eyes intelligence alliance warning on new AI models and cyber risk

Aoifuture Dispatch

Discussion about this post

Ready for more?