How to build audit-ready AI systems for regulated workflows.

Audit-readiness is not a posture you assemble before an inspection. It is the property of having engineered the system to answer one question for any past decision: why did it do that?

The auditor's question

Every regulated AI deployment we've shipped eventually faces the same question, asked by an internal auditor, an external auditor, a regulator, or, in the worst case, a court. The question is some variant of:

“
“On May 4th, at 14:22, your system told the operator to flag this transaction. Why?”

Your answer has to be reproducible. Not approximately. Not “we believe the model was responding to these features.” Reproducibly, deterministically: given the same inputs, the system would produce the same decision today.

This is harder than it sounds, and it is the load-bearing requirement around which everything else in the system is designed.

Reproducibility is the bar

In conventional software, reproducibility comes free. You ran the function with these inputs. The function is deterministic. The output is what it was.

In AI systems, three forces fight reproducibility: model non-determinism (sampling temperatures), corpus drift (the retrieved documents change over time), and prompt drift (the prompts themselves evolve). To produce a reproducible decision, you have to pin all three.

What to log, and how

We log every prompt, every model response, every tool call, every retrieval result, every approval action, and every input from every user. We log them to an append-only store with cryptographic chain-of-custody, each log entry references the hash of the previous entry.

// Each decision emits a structured trace.
type AgentTrace = {
  trace_id: string;            // UUID for the decision
  ts: ISO8601;
  user: { id: string; role: string; clearance: string };
  pinned: {
    model: string;             // e.g. "gpt-4-2024-08-06"
    sampling: { temperature: 0; top_p: 1 };
    prompt_template: string;   // e.g. "compliance.v3.2"
    corpus_snapshot: string;   // content hash of the index used
  };
  retrieval: {
    query: string;
    candidates: Array<{ doc_id; version; chunk_id; hash; score }>;
    used: string[];            // chunk_ids actually fed to the model
  };
  steps: Array<
    | { kind: 'tool';     name: string; input: any; output: any }
    | { kind: 'llm_call'; prompt_hash: string; response: string; tokens: number }
    | { kind: 'approval'; reviewer: string; decision: 'approve' | 'reject'; note?: string }
  >;
  decision: { outcome: string; confidence: number; evidence: string[] };
  prev_hash: string;           // chain-of-custody
  hash: string;
};

The store is append-only. Entries are signed. Tampering breaks the chain, which the system detects on read. This is not over-engineering, it is the minimum architecture required for an answer to “why did it do that?” to survive cross-examination.

Deterministic orchestration over auto-agents

We build agents as graphs, not as free-form auto-agents. A graph specifies the legal transitions: which steps can follow which, what conditions gate them, what state accumulates. The LLM makes decisions inside graph nodes, but the graph structure is code.

This matters for audit. With a graph, every execution trace fits a known shape. The auditor can verify that the system did not skip a required review step, did not bypass a guardrail, did not invoke a tool it should not have. With auto-agents, the trace is essentially a free-text story the model tells about itself.

Evidence chains

Every claim the agent makes must trace to evidence. We model this as a forward chain: a claim references the chunks that grounded it; the chunks reference the documents that contain them; the documents reference their source and version. The chain is machine-checkable and the auditor can walk it backwards from any output.

Claim

→

Chunks

→

Documents

→

Source · Version

Conceptual evidence chain: claim → chunks → documents → source registry.

The chain is also surfaced to operators inline. A reviewer reading the agent's output can click any claim and see the supporting evidence. The act of approval thereby covers both the decision and the evidence, and is itself logged.

The auditor contract

Before we deploy a regulated AI system, we negotiate an explicit contract with the auditor or compliance owner. The contract states:

What classes of decision the system makes autonomously.
What classes require human approval, by which role.
What classes are out of scope and must be refused.
What logging is preserved, for how long, in what form.
What evaluation suite gates change, and who owns it.
How the auditor reproduces a past decision from the logs.

This contract is the spec the system is built against. It also informs the eval suite, the refusal patterns, the graph topology, and the data retention policy. Engineering backwards from the contract produces a system that is audit-ready by construction, not by retrofit.

The systems that pass audits do not pass them because they were polished before the inspector arrived. They pass because they were designed, on day one, to make every past decision walkable from output back to input. That property does not retrofit. You either engineer for it from the start, or you rebuild later.

How to build audit-ready AI systems for regulated workflows.

The auditor's question

Reproducibility is the bar

What to log, and how

Deterministic orchestration over auto-agents

Evidence chains

The auditor contract

Read next.

Bring us your
most complex workflow.

The auditor's question

Reproducibility is the bar

What to log, and how

Deterministic orchestration over auto-agents

Evidence chains

The auditor contract

Read next.

Why RAG fails in enterprise, and what to design instead.

Designing human-in-the-loop approval systems.

Bring us yourmost complex workflow.

Bring us your
most complex workflow.