EngineeringApril 15, 2026·10 min read

Running Inference Across Multiple Colos Without a Central Brain

By Sev Geraskin

PolarGrid started with a single edge node in Vancouver, running a single model on a single GPU. A Dell PowerEdge blade server, with a loud fan creating ambient noise similar to an airplane engine on a transatlantic flight. The deployment process was: SSH into the box, pull the container, run it, point DNS at it. That prototype reduced inference network latency by more than 70% compared to centralized cloud routes.

When we expanded to Toronto and Montreal, we had to answer a question that dictates most of the architecture: how does a client request end up on the right GPU, and how do new models get onto the right nodes, without creating a bottleneck and having a single service in the middle that everything depends on?

We run inference across three production colocations today, and every node operates independently. There is no central orchestrator in the request path, and no failed single service that can take the whole fleet down. The rule we hold ourselves to: anything in the hot path is local to the node. Anything that can tolerate seconds or minutes of staleness can be centralized.

How a Node is Built

Each PolarGrid edge node is a bare-metal machine with NVIDIA GPUs that runs MicroK8s. The inference stack has two layers with a hard security boundary between them.

The first layer is the API gateway. It runs in a Kata container, providing VM-level isolation from the host. It faces the internet, handles TLS termination, authentication, and rate limiting. It speaks the OpenAI-compatible API: /v1/chat/completions, /v1/models, /health. It also exposes a /v1/models/load endpoint for dynamic model management.

The second layer is NVIDIA Triton Inference Server. Triton has direct GPU access via the NVIDIA runtime but has no external network exposure. A Kubernetes network policy blocks all traffic to Triton except from the gateway’s namespace. External requests hit the gateway; the gateway forwards to Triton over the internal cluster network. Triton runs inference on the GPU and returns the result.

This separation matters because the component touching the internet has no GPU access, and the component with GPU access cannot be reached from the internet. The attack surface of a public-facing inference endpoint drops substantially when you physically split the security boundary from the compute boundary.

How Models Get Onto Nodes

Each node runs deployment scripts that handle the full lifecycle: validate prerequisites, build Docker images for the gateway and Triton backend, create Kubernetes secrets for JWT signing keys and HuggingFace tokens, apply the deployment manifests via kubectl, and wait for health checks to pass.

Models themselves are loaded dynamically via the gateway. The gateway exposes a POST /v1/models/load endpoint which accepts a Hugging Face model identifier. When called, the gateway downloads the model, converts it to Triton’s format, and writes the resulting files to a shared volume.

A sidecar container running alongside Triton watches that volume. When it detects a new model directory with a valid config, it copies the files into Triton’s model repository and triggers a reload via Triton’s native functionality.

The result: model deployment is an API call to the node itself. No central service needs to know about it. An operator or automated pipeline can load a new model onto a specific node in minutes, without touching any other part of the fleet.

No Central Brain in the Hot Path

The design constraint we enforce: no request in production transits a service that isn’t on the node handling that request.

Authentication is validated at the node’s gateway using a JWT signing key stored locally. No auth service call on the request path. Rate limiting state is local to the node. Node health is reported via the /health endpoint on the node itself — consumers pull it, we don’t push it.

The only centralized infrastructure is the control plane: the system that tracks which models are on which nodes, collects aggregate metrics, and provides the data that the client-side router uses for node selection. This system runs on cheap, redundant infrastructure. When it’s unavailable, inference keeps working. Clients fall back to cached node lists. Models already loaded on nodes keep serving. The only thing that stops working is deploying new models and refreshing routing tables — both of which can tolerate minutes of delay without affecting active voice sessions.

What We Learned

The hardest part of running multi-colo inference without a central brain isn’t the technology. It’s the operational discipline of treating centralized state as a liability rather than a convenience.

Every time we’ve been tempted to add a centralized component to the hot path — a shared session store, a global rate limiter, a centralized model registry — we’ve caught it in design review and pushed the state local. The result is a fleet that degrades gracefully rather than failing catastrophically.

When the Toronto node had a network issue last month, Montreal and Vancouver kept serving. When we pushed a bad gateway config to one node, the client-side router detected the elevated error rate and routed around it within one probe cycle. No pages. No war room. The architecture handled it.

Try PolarGrid today

$500 in free credits. No card required. Sub-400ms voice pipeline live now.

Start Free →