pay.2nth.ai › Tree › ai › local-models

ai · Local models · Leaf

Sometimes the model stays home.

Not every payments workload can send data to an API. For data residency, PCI scope and confidential client material, the right model is one that runs on your own infrastructure. Open weights, Ollama, on-prem — and the same gateway that routes the edge and frontier tiers keeps it portable.

Local models On-prem Data residency PCI Open weights

01 · What it is

The self-hosted tier of the model strategy.

The edge-default strategy assumes you can send a request to a model somewhere. Sometimes you cannot. Data residency rules, PCI scope and confidential client data can all make “send it to an API” a non-starter — regardless of how good the API model is.

Local & on-prem models are the answer to that constraint. You run open-weight models on infrastructure you control — in your own cloud tenancy, a private data centre, or on a regulated client’s estate. The data never leaves the boundary. The model is smaller than a frontier API, but for the right workload that trade is exactly correct — and the same model gateway makes it a routing choice, not a separate build.

02 · How it works

Open weights, on your infrastructure.

Open-weight models — Llama, Mistral, Gemma-family and others — are downloaded and served locally. A runtime such as Ollama makes that a one-command affair for development; production serving uses the same open weights behind a hardened endpoint.

// Local serving — the model runs inside your boundary

  $ ollama pull llama3.1
  $ ollama run  llama3.1

  agent  →  model gateway  →  local endpoint (Ollama / on-prem)
                                  data never leaves the boundary

  // same gateway, same agent code — only the route changes.
  // residency & PCI policy decide when this route is used.

null

03 · When self-hosting beats an API

The constraints that flip the decision.

Data residency

When law or contract requires data to stay in-country or in-tenancy, a local model is the only compliant option — capability is secondary.

PCI scope

Sending cardholder or sensitive payment data to an external API drags it into scope and risk. Keeping inference local keeps the boundary clean.

Confidential client data

Diligence and advisory work touches material that simply cannot go to a third-party endpoint. On-prem is the control that lets the work happen at all.

Air-gapped or regulated estates

Banks and regulated operators often run environments with no egress to public APIs. The model has to come to the data.

04 · What you trade

Self-hosting is not free.

Dimension	Local / on-prem	Frontier API
Data boundary	Stays inside your control	Leaves to the provider
Raw capability	Smaller open-weight models	Largest frontier models
Ops burden	You run the GPUs & the serving	Provider runs it
Cost shape	Capex / fixed infra	Per-token, elastic
Best for	Residency, PCI, confidential data	Deep reasoning, no residency limit

05 · The portability story

One gateway, every tier.

Local is just another route

Local models are not a separate system — they are one more route behind the same model gateway that serves the edge and frontier tiers. The agent code does not change; a residency-or-PCI policy decides when the local route is used. This is the portability principle of the 2nth-ai/agent-platform control plane made concrete: open weights running on-prem through to frontier APIs, never locked to one vendor, all behind one seam with one audit trail.

06 · Where local models bite

On-prem is a control, not a free pass.

Self-hosting solves residency, but it introduces its own failure modes — and it does not change who is accountable. Watch these:

Smaller model, weaker reasoning

An open-weight model on-prem will not match a frontier API on nuanced regulatory reasoning. Do not let a residency requirement push a hard reasoning task onto a model too small for it — rescope the task instead.

You own the security now

A local endpoint is your attack surface, your patching and your access control. Residency compliance does not survive a badly-run GPU box.

Self-hosted is not automatically compliant

Running on-prem helps with residency, but PCI scope still depends on how the boundary is built and audited. Confirm it with a human, do not assume it.

Still never the accountable party

A local model signs off nothing, interprets no regulation and moves no money. The infrastructure changes; the human-in-the-loop rule does not.

07 · When to go local

Let the constraint decide.

Go local when a hard constraint — residency law, PCI scope, an air-gapped estate, or genuinely confidential client data — makes sending data to any API unacceptable. In that case the constraint, not the capability, picks the model, and on-prem open weights are the only correct answer.

Stay on the API tiers when no such constraint applies and the task needs reasoning depth: there, paying for a frontier model beats nursing a self-hosted box. The discipline is to decide by constraint first and capability second — and because everything routes through the gateway, you can change your mind without changing the agent.

08 · Connections

Where this sits in the tree.

Edge-default model strategy

The parent strategy — local models are its residency-and-PCI tier.

compliance

Compliance hub

PCI scope and data-protection leaves that drive the decision to keep inference local.

Research & regulatory-watch agent

The agent that runs across all tiers, local route included where residency demands it.

know.2nth.ai

agent-platform architecture

The partner-copyable control plane — Cerbos policy, Langfuse audit, model gateway — that these agents run on.

09 · Resources

Runtimes, weights and the platform.

RuntimeOllama — run open models locallyollama.com PlatformCloudflare Workers AIdevelopers.cloudflare.com/workers-ai Architectureknow.2nth.ai — control-plane architectureknow.2nth.ai/explainers/tools/architecture StandardNIST AI Risk Management Framework (AI RMF 1.0)nist.gov/itl/ai-risk-management-framework