Local LLMs and SOC 2 evidence: talking to auditors

Auditors are starting to ask about AI use in SOC 2 cycles. The shops running local LLMs have a different story to tell than the shops running cloud, and the auditors mostly haven't internalized the difference. Worth being explicit about what evidence actually answers the questions.

An open leather-bound auditor's ledger book on a dark wooden desk with a brass stamp resting on a paper certificate next to it

The 2025 SOC 2 audit cycle is the first one where, from every public account I’ve followed and every practitioner I’ve talked to about it, auditors are coming in with prepared questions about AI use. Some of those questions are well-formed. A lot of them are imported from the cloud-vendor SaaS audit framework and don’t quite fit the local-LLM shape. The shops running serious local-LLM workloads have a different evidence story than the shops on hosted; the auditors mostly haven’t internalized the difference yet; the gap produces friction unless the audited side comes prepared.

Worth being explicit about what auditors are actually asking, what local-LLM evidence answers their underlying control objectives, and the artifact list that makes the audit go smoothly rather than turning into a quarter of back-and-forth.

What auditors are asking, in 2025

The question categories that have shown up in the programs I’ve watched practitioners navigate and the public reporting I’ve read:

Data access scope. “What data is going to your AI system?” Auditors want to understand whether sensitive data (customer data, employee data, financial data) is being exposed to AI workloads, and what controls govern that exposure.

Vendor concentration. “Which AI vendors are you dependent on?” This question imports cleanly from the cloud-vendor framework. For shops on hosted models, it’s straightforward. For shops on local LLMs, the answer is “we run our own infrastructure,” which the auditor’s framework doesn’t have a clean place to record.

Output integrity. “How do you ensure the AI’s output is reliable when it informs decisions?” Auditors want to see evidence that the AI isn’t producing decisions the org is then committing to without human verification.

Incident handling. “What happens when the AI does something wrong?” The auditor wants the playbook.

Access control to the AI system itself. “Who can use the AI? Who can configure it? Who can see its outputs?” Standard access-control questions, applied to a foundation the auditor isn’t fully familiar with.

Audit trail completeness. “Show me the log of what the AI did over the past three months.” Common SOC 2 evidence request applied to a new system class.

These questions are mostly well-formed. The friction is in how the answers map to the auditor’s evidence framework.

Where the local-LLM story is stronger

For most of these question categories, the local-LLM answer is structurally stronger than the hosted-LLM answer:

Data access scope. With a local LLM, the data never leaves your network. The blast radius for “sensitive data exposed to AI” is bounded by your own infrastructure. The auditor sees a clean perimeter; the trust boundary is one you control.

Vendor concentration. A shop running local LLMs has substantially less vendor dependency than a shop on hosted. The model came from somewhere (Hugging Face, Meta, DeepSeek, etc.) but the runtime foundation is yours. The “what happens if the vendor disappears” question has a better answer.

Output integrity. Local-LLM workflows tend to have explicit human-in-the-loop checkpoints because the operators built them that way. The evidence of human review at decision points is concrete.

Audit trail. Logs are generated by your infrastructure, retained on your infrastructure, queryable by your tooling. No “we asked the vendor for the logs and they took two weeks” friction.

Access control. Standard org-wide IAM applies. The AI system isn’t a separate vendor SaaS with its own identity model; it’s a service in your stack with the same IAM your other services have.

These are real advantages. The challenge is that the auditor’s evidence framework is calibrated for the hosted case and the audited org has to translate.

Where the local-LLM story is weaker

Worth being honest about the cases where local LLMs make the audit harder:

Model provenance. When the auditor asks “where did this model come from and what’s in it,” the answer for hosted is “from the vendor under their certifications and policies.” For local, the answer is “we downloaded the weights from Hugging Face,” which the auditor reasonably wonders about. Provenance documentation is the gap to close, what model, what version, what license, when downloaded, signature verification.

Model behavior verification. The auditor’s framework for “does the AI do what it’s supposed to” assumes a vendor’s stated capabilities. For local, you’re stating your own capabilities and the auditor wants evidence. Evaluation suites and their results become important documentation.

Patch management. When a new version of the model comes out (with security improvements, bias fixes, capability updates), the org needs to have a process for evaluating and adopting it. The auditor wants to see this. Local-LLM shops often don’t have an obvious patch-cadence story; building one is part of the audit-readiness work.

Capacity and reliability evidence. Hosted vendors publish SLAs. Local infrastructure publishes whatever you measure. The auditor wants uptime evidence; you need to be measuring it.

These are addressable. They require specific work that hosted-LLM shops don’t have to do.

The artifact list that makes audits go smoothly

Concrete documents and evidence to have ready for a SOC 2 audit covering local-LLM use:

An AI use inventory. Which AI systems exist, which models they run, who owns them, what data they access, what controls apply. One page per system; updated quarterly.

An access-control matrix. Who can use which AI system at which scope. Integrated with the broader IAM evidence the audit already collects.

Data flow diagrams. Where data goes when it touches the AI. Includes the scoped indexing patterns showing what’s deliberately included vs excluded.

Model provenance records. For each model in production: source, version, hash/signature, license, download date, evaluation results. A small JSON file per model is sufficient.

Evaluation suite results. Periodic runs of an evaluation suite against the production model. Demonstrates that you’re verifying behavior rather than assuming it. Quarterly cadence is reasonable.

Audit logs. Conversation-level audit trails covering the relevant retention period. Queryable. The auditor will sample.

Incident records. Any AI-related incidents with their root causes, remediation, and follow-up. Including the case where nothing went wrong but the team did a tabletop exercise.

Patch / version management policy. When a new model version is released, what’s the process for evaluating and adopting it. Documented even if the cadence is “annually.”

Vendor management for the model sources. Treating Hugging Face / Meta / DeepSeek as upstream vendors of the model artifacts, with the same vendor-management discipline applied to other open-source upstreams.

Governance framework. The lightweight version of AI governance, the policy doc, the approved-tools list, the request process for the long tail.

That’s the artifact list. None of it is exotic; all of it is gettable from infrastructure and process you should have anyway. The audit-readiness work is mostly making the existing controls visible to the auditor’s framework.

The conversation that helps the auditor

When the auditor asks an awkwardly-shaped question (importing a cloud-vendor question that doesn’t quite fit), the conversation that helps is the one that translates from their framework to your reality:

  • “We’re not using a hosted vendor for this workload; we’re running it on infrastructure we own. The vendor-concentration question therefore maps to model-provenance and infrastructure-availability questions. Here’s how we evidence both.”
  • “The data doesn’t leave our network. The data-handling question therefore maps to internal IAM and data-classification controls. Here’s the evidence for those.”
  • “The model output is verified by humans before it informs decisions. Here’s the evidence of human review, the rate at which AI suggestions are accepted vs rejected, and the spot-check process.”

The auditor’s job is to evaluate controls against objectives. When you make the mapping explicit between their framework and your reality, the evaluation goes smoothly. When you don’t, the evaluation gets stuck on the framework mismatch.

The bigger frame

SOC 2 with AI is going to be a normal part of audit cycles going forward. The shops doing it well are the ones who treat AI infrastructure with the same control discipline as their other infrastructure. The shops doing it badly are the ones who treat AI as exotic and outside the existing audit framework, they end up with friction every cycle.

Local-LLM shops have the structural advantage on most of the questions. They have specific gaps to close (provenance, evaluation, patch cadence) that don’t apply to hosted shops. The closing-the-gaps work is bounded; doing it once and then maintaining the artifacts is easier than the year-after-year audit friction the alternative produces.

For shops about to go through their first SOC 2 audit covering AI use: the artifact list above is the homework. Doing it before the auditor asks is the difference between a smooth audit and a stressful one. The auditors are still calibrating their AI-specific framework; the audited shops can shape that calibration by showing up prepared with the right evidence in the right shape.

Worth treating this as the long-run discipline it is rather than the one-off scramble it usually becomes.