All insights

AI in Operations

Four Hallucinations in One Session: Grep-Verify Is the Cheap Counter

When parallel sub-agents touch many similar files, plausible-sounding but wrong quotes are the dominant failure mode. The cheapest counter is a discipline rule, not a model upgrade.

Jason Walker

.6 min read

Five hours into a long audit on a long-form document, I caught four separate sub-agent hallucinations in a single session. Not subtle drift. Specific, on-topic, plausible-sounding claims about file contents that were just wrong. Each one would have produced a real correction in a real document if I had taken the agent's word for it. Each one died inside thirty seconds when I grepped the source.

The pattern matters because it generalizes. If you are running parallel LLM agents on any non-trivial codebase, document set, citation list, or knowledge graph, this is what failure looks like. The agents will not throw errors. They will not say "I am uncertain." They will return confident, structured, well-formatted reports containing fabricated specifics that pass face validity and fail mechanical verification.

The conventional framing of LLM hallucination is that the model invents facts that do not exist. That happens, but it is not the most dangerous case. The dangerous case is the model returning facts that exist somewhere nearby. The right paper but the wrong year. The right line number but the wrong quoted text. The right author but the wrong co-authors. The right concept but the wrong file. These hallucinations are dangerous because they cluster in the gap between what is true and what feels true to the agent, and that gap is exactly the space humans tend to skim past.

Here is what the four catches looked like in this session.

The first claim said one chapter in the document set read "more than 30" for a population count while another chapter read "35." Both should have matched. The fix would have been to update one chapter to match the other. I greped the cited line numbers. Both chapters said 35. The "more than 30" string did not exist anywhere in the file. The agent had inferred a plausible-sounding inconsistency from nothing.

The second claim said one chapter used an abbreviation "Vuln" while every other chapter used the single letter "V" for the same concept. The fix would have been to standardize across the document. I greped the chapter. Every occurrence used "V." There was no "Vuln" abbreviation in the chapter at all.

The third claim said the reference list of one chapter contained a different author list for a paper than another chapter. The fix would have been to harmonize the entries. I greped both reference lists. They matched. The agent had confused the canonical chapter file with a supporting draft memo that did contain a different entry, and reported the memo's content as if it were the chapter's content.

The fourth claim came from a different agent doing a different job: building a reference list from in-text citations. The agent produced an entry that mixed the author list from a 2023 working-paper version of a paper with the URL from a 2024 preprint version of the same paper. Two real publications, one fabricated mashup. The year did not match the authors did not match the venue.

In each case, the verification step was identical. Take the specific claim the agent made (line number, quoted text, named entity, dated event). Grep the source. Read the actual line. The agents had been working for five to fifteen minutes each on these reports. The grep cost about ten seconds per claim. The cost ratio of catching the fabrication versus propagating it into a real document is at least ten to one in favor of catching, and that ignores the much larger downstream cost of finding the mistake later through some other route.

The reason agents fabricate this way is structural. A sub-agent reading many similar files (chapters with parallel structure, modules with similar APIs, log files with rotating timestamps) builds a partial mental index of what is in those files. When it composes a report, it pulls from that index. The index is approximately right but not exactly right. The first-author surname is in the right neighborhood but the actual canonical file has a slightly different surname. The line numbers are within a few of correct but not exact. The quoted text captures the gist but reorders words. None of these errors trip an internal consistency check because each one is locally plausible.

You cannot prompt-engineer this away. You can ask agents to "verify before reporting" and they will sincerely promise to do so and continue to fabricate. The model that confabulates the report is the same model that confabulates the verification. Internal verification is not adversarial enough to catch its own hallucinations.

Mechanical external verification is. Grep is adversarial in exactly the right way. Grep does not care what the agent meant. Grep does not interpret. Grep returns the literal bytes of the file or returns nothing. If the agent claims a specific quoted string exists at a specific location and grep does not find that string, the claim is wrong. The proof is binary. The verification is cheap.

The discipline rule that follows is simple: treat every sub-agent factual claim about file content as a hypothesis until you grep the source and confirm it. Do not delegate verification to another agent. Do not skip the check because the report "looks fine." Do not assume the agent must have checked. The agent has not checked. The agent will not check. You check.

This is more work than just reading the agent's report. It is less work than fixing the consequences of acting on a fabricated finding. Over a long session with five lenses returning eighteen findings between them, this discipline took about fifteen minutes total and caught four real errors that would have shipped if I had trusted the reports at face value. The math is not close.

If you are building multi-agent workflows for high-stakes work, build the grep discipline into the loop. Make it part of the synthesis step, not optional. Mark every agent claim with a verification status before applying any edit. The cost is minutes per session. The savings is hours per drift incident. The pattern works because it does not depend on the agents being better. It depends on the verifier being mechanical.

Trust is for things that have earned it. Sub-agent reports about file content have not earned it. Grep has.

Keep reading

Weekly writing from inside the work.

Practitioner-researcher essays four times a week. No spam, unsubscribe in one click.

Subscribe

Weekly writing from inside the work.

Field observations and framework critiques from a practitioner-researcher running cybersecurity at scale. AI in operations, FAIR risk research, and the leadership patterns that hold both together. No spam. Unsubscribe in one click.