Privacy PolicyCookie Policy
    Blog
    How to Audit AI Systems in Production
    Technical Report

    How to Audit AI Systems in Production

    ByVeratrace Research·Research Team
    February 3, 2026|6 min read|1,158 words
    Share
    Research updates: Subscribe

    Auditing AI systems requires different approaches than traditional IT audits. AI systems are dynamic, probabilistic, and opaque in ways that challenge conventional audit methods. Here is what works.

    01Auditing AI Is Different

    A national health insurer discovered this distinction during a state examination. Their internal audit team had included AI systems in their standard IT audit program. They verified access controls, change management, system availability—the usual controls. When the state insurance commissioner's examiner asked about algorithmic discrimination testing, the audit director pointed to their processing integrity controls. The examiner clarified: "Those controls verify that the system produces consistent outputs. I'm asking whether you've tested whether consistent outputs systematically disadvantage protected groups." The company had never audited for bias. Their IT controls couldn't answer the regulator's question. The examination became significantly more intensive than planned.

    This gap between traditional IT audit and AI audit is common—and increasingly costly.

    AI systems in production need audit, but traditional audit methods don't transfer directly. AI systems have characteristics that require adapted approaches: probabilistic outputs that vary even with identical inputs, learned behavior that can't be traced to written specifications, emergent bias that develops through data patterns, drift over time as the world changes, and complex interactions that are difficult to fully characterize.

    Standard IT audits verify that systems operate according to specifications. AI systems learn their behavior, creating new audit challenges.

    02The AI Audit Framework

    Inventory and Classification

    What to verify: Does the organization have a complete inventory of AI systems? Are systems appropriately classified by risk?

    How to verify: Request the AI system inventory. Compare against other sources (contracts, infrastructure records, development backlogs). Verify classification methodology. Sample systems and validate classification.

    Common gaps: Missing systems, especially those embedded in vendor products. Inconsistent classification criteria. No process for updating inventory as systems change.

    Governance Structure

    What to verify: Does governance exist with clear roles, processes, and authority?

    How to verify: Review governance documentation. Interview governance participants. Observe governance in action. Verify that governance decisions are implemented.

    Common gaps: Governance on paper but not in practice. Unclear accountability. Governance that doesn't cover all AI systems.

    Development and Validation

    What to verify: Were AI systems developed with appropriate rigor? Was validation sufficient for risk level?

    How to verify: Review development documentation. Examine validation methodology and results. Verify that validation addressed relevant risks. Compare development practices to stated policies.

    Common gaps: Undocumented development decisions. Insufficient validation for high-risk systems. Missing records of model selection rationale.

    Data Governance

    What to verify: Is training data appropriately sourced and documented? Are data quality controls in place? Is data provenance traceable?

    How to verify: Review data sourcing documentation. Examine data quality metrics and monitoring. Trace sample data from source to model. Verify consent and authorization for data use.

    Common gaps: Undocumented data sources. Missing data quality validation. Inability to trace data provenance.

    Deployment Controls

    What to verify: Are deployment processes controlled? Is configuration documented? Are access controls appropriate?

    How to verify: Review deployment procedures and approval records. Examine configuration management. Test access controls against policy. Verify separation of duties.

    Common gaps: Uncontrolled deployment paths. Undocumented configuration changes. Excessive access privileges.

    Operational Monitoring

    What to verify: Is AI system behavior monitored? Are anomalies detected and investigated? Is performance tracked against expectations?

    How to verify: Review monitoring dashboards and alert configurations. Examine sample alert investigations. Compare actual performance to documented expectations. Verify escalation procedures.

    Common gaps: Insufficient monitoring coverage. Unaddressed alerts. Missing performance baselines.

    Decision Logging and Audit Trails

    What to verify: Are AI decisions logged comprehensively? Are logs immutable? Are retention requirements met?

    How to verify: Review logging configurations. Attempt log retrieval for sample decisions. Verify log integrity controls. Compare retention to requirements.

    Common gaps: Incomplete logging. Mutable log storage. Premature log deletion.

    Human Oversight

    What to verify: Is human oversight proportionate to risk? Are oversight actions documented? Are overrides captured and analyzed?

    How to verify: Review oversight procedures. Examine sample oversight decisions. Verify documentation of human review. Analyze override patterns.

    Common gaps: Rubber-stamp oversight. Undocumented review decisions. Ignored override patterns.

    Bias and Fairness

    What to verify: Is bias monitored for protected characteristics? Are fairness metrics appropriate? Is remediation effective?

    How to verify: Review bias monitoring methodology. Examine fairness metrics and thresholds. Analyze outcomes by protected groups. Verify remediation actions.

    Common gaps: Incomplete protected class coverage. Inappropriate fairness metrics. Unaddressed bias findings.

    Incident Response

    What to verify: Are AI-specific incident procedures in place? Are incidents documented and analyzed? Is remediation tracked?

    How to verify: Review incident response procedures. Examine sample incident records. Verify root cause analysis. Track remediation completion.

    Common gaps: Generic procedures that miss AI-specific issues. Incomplete incident documentation. Untracked remediation.

    03Audit Methods for AI Systems

    Document Review

    Examine policies, procedures, and records. For AI systems, this includes AI governance policies and standards, model development documentation, validation and testing records, deployment approvals, operational procedures, and incident records.

    Testing

    Verify that documented controls work as described. For AI systems, this includes input validation testing, output monitoring verification, access control testing, log retrieval testing, and alert response testing.

    Sampling

    Examine specific instances to verify general compliance. For AI systems, this includes sample decisions for audit trail completeness, sample models for documentation compliance, sample incidents for procedure compliance, and sample data sources for governance compliance.

    Observation

    Watch processes as they occur. For AI systems, this includes observing oversight processes, watching incident response, monitoring operational procedures, and observing deployment processes.

    Inquiry

    Interview personnel to understand processes and identify gaps. For AI systems, this includes interviewing data scientists about development practices, interviewing operators about monitoring practices, interviewing oversight personnel about review processes, and interviewing leadership about governance understanding.

    04Audit Evidence for AI Systems

    What Constitutes Good Evidence

    Evidence should be complete (capturing all relevant information), accurate (reflecting actual events correctly), timely (recorded contemporaneously with events), immutable (cannot be altered after creation), accessible (can be retrieved efficiently), and understandable (interpretable by auditors).

    Evidence Challenges Specific to AI

    Model behavior is difficult to capture completely. Probabilistic outputs vary in ways that are expected. Drift occurs gradually and may not trigger discrete events. Complexity may exceed auditor understanding.

    Addressing Evidence Challenges

    Capture decision context, not just outputs. Establish expected variation ranges and document departures. Implement drift detection with documented thresholds. Provide auditor training on AI-specific concepts.

    05How Platforms Like Veratrace Support Audits

    AI governance platforms provide infrastructure that makes audits more effective: comprehensive audit trails that satisfy evidence requirements, system inventories that demonstrate coverage, approval workflows that create accountability records, monitoring dashboards that demonstrate operational oversight, and reporting that supports audit sampling and analysis.

    The goal is to make audit evidence a byproduct of normal operations rather than a special effort.

    06Conclusion

    Auditing AI systems requires adapted approaches that address their unique characteristics. Organizations that build auditability into their AI systems from the start will find audits less disruptive and more effective.

    The key is treating audit readiness as an operational requirement, not an afterthought. AI systems should generate audit evidence continuously, making audits a matter of analysis rather than reconstruction.

    Cite this work

    Veratrace Research. "How to Audit AI Systems in Production." Veratrace Blog, February 3, 2026. https://veratrace.ai/blog/how-to-audit-ai-systems-production

    VR

    Veratrace Research

    Research Team

    Contributing to research on verifiable AI systems, hybrid workforce governance, and operational transparency standards.

    Related Posts

    ai-change-management
    operational-controls

    AI System Change Management Controls Most Teams Skip

    When an AI system changes behavior — through model updates, prompt revisions, or config changes — most enterprises have no record of what changed, when, or why.

    VG
    Vince Graham
    Mar 3, 2026
    ai-vendor-billing
    reconciliation

    AI Vendor Billing Reconciliation Is the Governance Problem Nobody Budgets For

    AI vendor invoices describe what vendors claim happened. Reconciliation against sealed work records reveals what actually did.

    VG
    Vince Graham
    Mar 3, 2026
    AI Work Attribution Breaks Down in Multi-Agent Systems
    ai-attribution
    multi-agent-systems

    AI Work Attribution Breaks Down in Multi-Agent Systems

    When multiple AI agents and humans contribute to a single outcome, traditional logging cannot answer the most basic question: who did what.

    VG
    Vince Graham
    Mar 3, 2026