Why AI Projects Need Evidence, Not Just Documentation

Documentation Alone Doesn't Work

Every serious software project eventually develops a set of written rules — a style guide, a deployment checklist, a list of "things we learned the hard way, so please don't do them again." These documents multiply over time, grow headers like "IMPORTANT" and "HARD RULE," and are read carefully on the first day.

Then the rules quietly get violated.

Not because the team is careless. Not because the rules are wrong. The rules get violated because the AI assistants now doing most of the code production in modern projects are not deterministic. They read the rules at the start of a session, they genuinely try to follow them, and then — under the cognitive load of actually solving the problem — they drift. They answer from what they remember instead of re-reading. They take the shortcut that feels faster. They rationalize past the rule when it seems to conflict with the task. Human teams do the same thing, but AI does it at the speed and scale of every run. A documented rule with no enforcement mechanism is a suggestion. Code is deterministic; suggestions decay.

This isn't a moral failure. It's an architectural failure in the process itself. And it shows up in AI projects more than most, because AI systems are composed of many small decisions — each one individually reasonable, collectively catastrophic if even a small fraction skip the intended checks.

The fix isn't better-worded documentation. The fix is to replace prose enforcement with evidence.

The Three Layers of Process Enforcement

Every project's quality controls fall into one of three tiers, whether or not the team has thought about it explicitly:

Tier 1: Deterministic Enforcement

The strongest tier. A tool, script, or automated gate blocks the bad outcome entirely. Code can't be merged if tests fail. A deploy can't proceed without a preflight marker. A form can't submit without a required field. The actor has no ability to skip these checks by choice or accident.

Tier 2: Structural Gates

A well-designed workflow forces you to produce specific artifacts at specific points. A deploy checklist has required fields. A pull request template requires named reviewers. A design document has to exist before implementation begins. These gates can be bypassed if someone's determined to, but the friction is real — and the artifact itself becomes evidence that the gate was cleared.

Tier 3: Advisory

The weakest tier, and the most common. The "rule" exists only as prose — in a style guide, a CLAUDE.md file, a checklist nobody runs, a slide deck nobody re-reads. Compliance is entirely behavioral. Every time the rule applies, the actor has to remember it, interpret it, and choose to follow it.

Most projects accumulate a long list of Tier 3 rules and slowly discover that the list doesn't actually change behavior — because prose enforcement has a hard ceiling that more prose can't raise.

The principle: If a rule has been violated even once, a better-worded rule won't prevent the next violation. The rule needs to move up a tier — from prose to an artifact gate, or from an artifact gate to a deterministic check.

What "Evidence-Gated" Actually Means

An evidence-gated workflow turns each step of a process into a step that produces a concrete artifact — a file, a record, a signed approval — proving the step actually happened and was done correctly. The next step of the workflow cannot proceed until the evidence from the previous step exists.

Consider a simple example: deploying a new feature. A prose-based version of the rule might say "always test your deploys in staging first." An evidence-gated version might require:

A design note describing what you're about to build — saved as a file, not a thought
A staging-deploy record produced by the deployment tool, timestamped and attached to the version
A test-run record showing the key user flows pass in staging
Only then does the production-deploy step unlock

Each artifact is small. None of it requires heroic discipline to produce. But collectively, they create a chain of evidence that anyone can audit later — and they make shortcuts visibly expensive, because the absence of the artifact is its own loud signal.

The subtle win here is the word "audit." Prose-based processes often look clean because there's no lasting record of what was actually done. An evidence-gated process produces a trail. A month later, someone can walk that trail and see exactly which steps happened and which didn't.

How Evidence Gates Actually Get Enforced

An evidence gate only works if something mechanically checks for the artifact before letting the next step proceed. That "something" is a hook — a small piece of code that runs at a defined point in a workflow and can block the next action if the expected evidence isn't there. Without hooks, evidence gates are just another piece of documentation; with hooks, they become actual enforcement.

If you're working in Claude Code — one of the leading AI coding assistants — the simplest enforcement layer is a feature called hookify. Hookify rules are small, pattern-matching checks that run before tool calls. A rule might say "block any commit whose diff contains a specific sensitive phrase" or "reject edits to files under a protected directory." They're stateless and keyword-based, but the enforcement is real: when the rule fires, the action doesn't happen.

For richer checks — "does the required design document exist on disk?" "has this file actually been read during the current session?" "is the deploy preflight marker present and recent?" — you write custom code hooks. These are ordinary Python or shell scripts that run at the same interception points and can carry out arbitrary logic. A custom hook opens the plan folder, confirms the expected evidence file is present and valid, and refuses the next step if it isn't. This is what turns an evidence-gated workflow from a documented description into genuine enforcement: each gate becomes a deterministic check, not a rule someone has to remember.

Common Root Causes That Prose Rules Can't Fix

When you start cataloging why documented rules get violated in AI-assisted workflows, a short list of root causes explains most incidents. Each is behavioral — a consequence of LLMs being statistical rather than deterministic — and each is invisible to prose enforcement:

Memory Substitution

An AI assistant answers from what it remembers about a file — from earlier in the session, or from patterns it's seen in similar codebases — rather than reading the file fresh. This is how stale code gets referenced, how outdated assumptions about configuration carry forward, how "I recall this is set to X" becomes a fact nobody verifies. LLMs are especially prone to this because answering from existing context is cheaper than issuing another tool call to re-read the source.

The evidence fix: require the specific file (or its checksum, or its last-modified time) to be logged as part of the work product.

State Blindness

An AI assistant edits a local copy of a configuration, but the live production system has drifted from what's checked into the repo. Or it writes a value to a write-only secret store without saving a recoverable copy anywhere else. Or it generates code against a database schema that no longer matches what's actually deployed. The assistant has no way to know the live state differs from the repo unless it's specifically instructed to fetch and compare — and most prompts don't ask it to.

The evidence fix: force a "fetch live, compare, report drift" step at the start of any change, and produce an artifact recording what was found.

Workflow Substitution

An LLM skips a documented step of the process — a rule from a CLAUDE.md file, a style guide, a session-start checklist — because skipping feels faster and the output still looks complete without it. Across many runs, this compounds into a pattern of small rationalizations each of which was individually "not a big deal" but collectively erodes the quality of the work. Prose-based rules are especially easy for an LLM to talk itself past: "this rule doesn't quite apply here," "this is a simple case," "I'll skip the checklist just this once."

The evidence fix: make the skipped step produce a required artifact, so the absence is visible.

Transitive Dependency Blindness

An AI assistant writes code that runs cleanly in its development context — because every package it needs happens to be installed locally — but the production deploy is missing one, so the service crashes on the first real request. The assistant has no visibility into what's actually installed at deploy time. The local green light was taken as a signal, when in reality the only signal that matters is whether the production image contains every transitive dependency the code touches.

The evidence fix: require an explicit environment-diff artifact before every deploy.

Prose Rule Decay

A rule that was written into a CLAUDE.md file, a skill description, or a session-start checklist gets genuinely loaded at the beginning of a session — and then never referenced again during the actual work. The LLM read it, briefly incorporated it into context, and then moved on to solving the task. Without an enforcement loop, the rule lives as documentation rather than a gate — and nothing reminds the assistant it's there at the moment it would have mattered.

The evidence fix: turn the rule into a gate that produces an artifact. Gates stay fresh; prose forgets.

Applying the Pattern Without Enterprise Tooling

Most small businesses don't have a budget for enterprise workflow tooling. The good news is that the principles above don't require enterprise tools — they require a few concrete practices that fit inside ordinary file folders, version control, and shared documents.

Use Files as Evidence

For each non-trivial project: create a folder. Each major step of the work produces a named file in that folder — a design note, a plan, a test-run log, a deploy record. The folder itself is the audit trail. You can grep it, share it, and archive it when the project is done.

Name the Artifact Up Front

Before starting a step, decide what artifact it will produce. "I'll do the research" is not an evidence-gated step. "I'll produce a research-summary.md listing the 3 vendors evaluated and the criteria used" is. The act of naming the artifact forces clarity about what "done" means.

Make the Next Step Gate on the Artifact

"I can't start the build until design-spec.md exists." "I can't mark the deploy complete until deploy-log.txt shows a healthy response." The gating is voluntary at first, but over time it becomes habit, and eventually you can automate it with a pre-commit hook or a simple CI check.

Review Evidence, Not Claims

When a project reports as done, the review is: walk the evidence folder, confirm each expected artifact exists, read each one. If the artifact is missing, the step isn't done — regardless of what anyone says.

A Checklist: Moving From Prose to Evidence

Audit Your Current Process

List the rules your team or project relies on today
For each rule, identify which tier it's enforced at (Tier 1, 2, or 3)
Highlight every rule that's been violated even once — these are your first promotion candidates

Design the Gates

For each Tier 3 rule: what artifact would prove the rule was followed?
Is that artifact producible as a normal part of the work, or would it be busywork?
Can the next step be made to check for the artifact's existence automatically?

Implement One at a Time

Pick the most-violated rule first
Define the evidence artifact and where it lives
Update your process documentation to describe the gate
Run the process for a month — see if the rule still gets violated

Keep a Ledger of Violations

Every time a rule is violated, log it with the date, context, and root cause
Patterns will emerge — the rules that keep getting violated are the ones most worth promoting to higher tiers

Conclusion

Documentation is valuable — but it is not the same thing as enforcement. A team or system of any size that relies solely on prose to govern its quality will drift over time, not because people are lazy, but because prose has a ceiling that no amount of better wording can push through.

Evidence gates — small artifacts produced at each step of a process — are the practical fix. They don't require enterprise tooling, they don't require new hires, and they don't require perfect discipline. They just require a willingness to write down what "done" means and then check for the artifact before moving on.

For small businesses deploying AI and automation, this pattern is especially valuable. AI projects are inherently composed of many interacting pieces, and the cost of a missed check compounds fast. Moving even a few of your most-violated rules from prose to evidence will, in the long run, do more for quality than any amount of additional documentation.

Obtainium.ai builds custom AI automation for service-based small businesses. 30+ years in IT and IT security, CISSP and CAISS certified — we build systems that run in production, not demos that look good in a sales meeting. Based in Reno, NV, serving businesses nationwide.

Why AI Projects Need Evidence, Not Just Documentation

Documentation Alone Doesn't Work

The Three Layers of Process Enforcement

Tier 1: Deterministic Enforcement

Tier 2: Structural Gates

Tier 3: Advisory

What "Evidence-Gated" Actually Means

How Evidence Gates Actually Get Enforced

Common Root Causes That Prose Rules Can't Fix

Memory Substitution

State Blindness

Workflow Substitution

Transitive Dependency Blindness

Prose Rule Decay

Applying the Pattern Without Enterprise Tooling

Use Files as Evidence

Name the Artifact Up Front

Make the Next Step Gate on the Artifact

Review Evidence, Not Claims

A Checklist: Moving From Prose to Evidence

Conclusion

Ready to Put AI to Work?

Book a Free Call

AI Readiness Audit