Pilots That Only Prove What You Already Believe
Government AI pilots are designed to succeed, not to learn. Here's why that's the wrong architecture and what a real test looks like.
Jason Walker
State CISO, Florida
There is a specific kind of simulator session that every pilot dreads. Not the engine fire at rotation or the hydraulic failure on approach. Those are hard, but they are scripted. The one that gets you is the scenario where the instructor builds normal conditions, lets you settle into the routine, and then introduces a failure mode you did not expect in a context you thought was safe. That is where you find out what you actually know versus what you have been performing.
I flew enough of those sessions to understand something about testing. A simulator is not useful because it is controlled. It is useful because it can be adversarial. Remove the adversarial pressure and you do not have a test. You have a rehearsal.
State governments are building rehearsals and calling them pilots.
When Michigan's legislature began weighing a formal AI pilot program with a governing board to set standards around privacy, bias, and responsible use, they joined a long line of states doing roughly the same thing. Controlled experimentation. Sandboxed environments. Oversight structures standing ready to evaluate results. The framing sounds rigorous. It is not.
The problem is not the sandbox. Sandboxes are fine. The problem is the question the sandbox is designed to answer.
Most government AI pilots are designed to answer: "Is this tool safe enough to use?" That question produces a very specific kind of test. You select a low-risk use case. You monitor outputs for obvious failures. You survey employees. You document the process. You declare results. And because you chose a low-risk case and monitored it carefully, the answer almost always comes back: safe enough.
This is not science. It is confirmation. You built the conditions to support the conclusion you needed to reach to move forward, and then you moved forward.
The question a real pilot answers is different: "How does this fail, and under what conditions does the failure become catastrophic?" That question requires building adversarial pressure into the test design before the experiment starts. Not a governing board that convenes after the vendor has already scoped the use case. Not a review panel that evaluates outputs from scenarios the vendor helped design. Adversarial assumptions baked into the architecture from day one.
I deal with this constantly in cybersecurity. An agency deploys a new detection tool and runs it against known threat signatures for ninety days. Nothing fires. They declare success. What they actually proved is that the tool works against threats they already knew about, in an environment they sanitized for the test. The threat actor operating in the gaps of that environment was never part of the scenario.
Same failure mode. Different domain.
The governing board structure that shows up in most AI pilot frameworks makes this worse, not better. Boards that set standards after the use case is already selected and the vendor is already engaged are not oversight. They are ratification. The consequential decisions, what to test, what failure looks like, what data the tool touches, what the tool is allowed to get wrong, get made by the people running the pilot before the board ever meets. By the time the board reviews results, the experiment is already optimized for the answer it produced.
If you want a governing board that actually changes outcomes, put it upstream. The board's job is to define the failure conditions before the pilot launches. What does a bad outcome look like? Not "the tool produces biased output" in the abstract. Specifically: this tool is processing eligibility determinations, and a bad outcome is a false negative rate above X percent for applicants in demographic group Y. The board approves that failure definition. The pilot is then designed to stress-test against it.
That is a completely different exercise than what most states are running.
There is also a structural problem with using pilots to evaluate generative AI specifically. Generative AI fails in ways that are stochastic, context-dependent, and invisible at the aggregate level. A tool that performs well across ten thousand transactions can still be producing confident, wrong outputs on a specific subset of inputs that correlates with a protected characteristic or a high-stakes decision. You will not find that failure by measuring average performance in a controlled environment. You find it by designing the pilot around the worst-case user and the worst-case input and the worst-case downstream consequence, and then measuring whether the tool fails those cases at acceptable rates.
Most pilots are not designed around worst cases. They are designed around representative cases, which means they are designed around the middle of the distribution, which means they are systematically blind to tail risk.
In state government, tail risk is the whole story. The average transaction is fine. The edge case is where someone loses a benefit, makes a wrong decision based on a hallucinated legal summary, or trusts an output that leads to a harm that never appears in the aggregate metrics. You cannot run a statistically valid pilot with enough adversarial edge cases in a ninety-day window across two agencies. The sample size does not support the conclusion. But governments write reports based on those pilots as if they do.
I am not arguing against pilots. I am arguing against pilots that are designed to graduate to deployment on a predetermined schedule regardless of what they find, with a governing board that provides process legitimacy rather than substantive friction.
The version of this that actually works looks different. You start with the failure taxonomy, not the use case. You define what "this went wrong" means in concrete, measurable terms before you choose the vendor or scope the experiment. You build adversarial test cases, not just representative ones. You staff the governing board with people who are explicitly tasked with finding reasons to stop, not just approve. And you create an explicit mechanism for "we learned enough to know we are not ready," which gets treated as a successful outcome, not a program failure.
That last piece is the hardest part in government. A pilot that concludes "not yet" is politically costly. Nobody got budget authority to run a program that comes back empty. So pilots get designed to succeed, and the failures get discovered in production, after the contracts are signed and the workflows are rebuilt.
I have sat in enough post-incident reviews to know where this goes. The simulator session that only confirms what you already believe is not preparation. It is theater.
Real testing is adversarial by design. Build the pressure in from the start, or do not call it a test.