Not every payments workload can send data to an API. For data residency, PCI scope and confidential client material, the right model is one that runs on your own infrastructure. Open weights, Ollama, on-prem — and the same gateway that routes the edge and frontier tiers keeps it portable.
The edge-default strategy assumes you can send a request to a model somewhere. Sometimes you cannot. Data residency rules, PCI scope and confidential client data can all make “send it to an API” a non-starter — regardless of how good the API model is.
Local & on-prem models are the answer to that constraint. You run open-weight models on infrastructure you control — in your own cloud tenancy, a private data centre, or on a regulated client’s estate. The data never leaves the boundary. The model is smaller than a frontier API, but for the right workload that trade is exactly correct — and the same model gateway makes it a routing choice, not a separate build.
Open-weight models — Llama, Mistral, Gemma-family and others — are downloaded and served locally. A runtime such as Ollama makes that a one-command affair for development; production serving uses the same open weights behind a hardened endpoint.
// Local serving — the model runs inside your boundary $ ollama pull llama3.1 $ ollama run llama3.1 agent → model gateway → local endpoint (Ollama / on-prem) data never leaves the boundary // same gateway, same agent code — only the route changes. // residency & PCI policy decide when this route is used.
When law or contract requires data to stay in-country or in-tenancy, a local model is the only compliant option — capability is secondary.
Sending cardholder or sensitive payment data to an external API drags it into scope and risk. Keeping inference local keeps the boundary clean.
Diligence and advisory work touches material that simply cannot go to a third-party endpoint. On-prem is the control that lets the work happen at all.
Banks and regulated operators often run environments with no egress to public APIs. The model has to come to the data.
| Dimension | Local / on-prem | Frontier API |
|---|---|---|
| Data boundary | Stays inside your control | Leaves to the provider |
| Raw capability | Smaller open-weight models | Largest frontier models |
| Ops burden | You run the GPUs & the serving | Provider runs it |
| Cost shape | Capex / fixed infra | Per-token, elastic |
| Best for | Residency, PCI, confidential data | Deep reasoning, no residency limit |
Local models are not a separate system — they are one more route behind the same model gateway that serves the edge and frontier tiers. The agent code does not change; a residency-or-PCI policy decides when the local route is used. This is the portability principle of the 2nth-ai/agent-platform control plane made concrete: open weights running on-prem through to frontier APIs, never locked to one vendor, all behind one seam with one audit trail.
Self-hosting solves residency, but it introduces its own failure modes — and it does not change who is accountable. Watch these:
An open-weight model on-prem will not match a frontier API on nuanced regulatory reasoning. Do not let a residency requirement push a hard reasoning task onto a model too small for it — rescope the task instead.
A local endpoint is your attack surface, your patching and your access control. Residency compliance does not survive a badly-run GPU box.
Running on-prem helps with residency, but PCI scope still depends on how the boundary is built and audited. Confirm it with a human, do not assume it.
A local model signs off nothing, interprets no regulation and moves no money. The infrastructure changes; the human-in-the-loop rule does not.
Go local when a hard constraint — residency law, PCI scope, an air-gapped estate, or genuinely confidential client data — makes sending data to any API unacceptable. In that case the constraint, not the capability, picks the model, and on-prem open weights are the only correct answer.
Stay on the API tiers when no such constraint applies and the task needs reasoning depth: there, paying for a frontier model beats nursing a self-hosted box. The discipline is to decide by constraint first and capability second — and because everything routes through the gateway, you can change your mind without changing the agent.