
A multi-model inference stack is only as efficient as the GPUs it keeps busy. If every small model gets its own GPU, the system is easy to conceptualize, but the hardware may just sit idle and underutilised. If too many models share one GPU without the right scheduling, the system becomes slow in a different way: short requests wait behind long kernels, caches fragment, and the GPU looks busy without doing the work that matters
We built this inference engine to serve a routing stack from individual Blackwell GPUs instead of a fleet of lightly used single-model GPUs. The hot path needs several models: a 4B SLM, a query embedder, a document embedder, a small reranker, and a large reranker path. Individually, each model is small or bursty. Together, they fit in memory and justify keeping the GPU under load and fully utilisied.
The goal was to turn unused GPU memory into resident model capacity, while keeping the accelerator compute-bound on useful work instead of reserved for idle services.
That made the core problem an infrastructure problem. The models fit in VRAM. The hard part was making them share the GPU without paying the churn tax.
Motivation
The naive deployment is one model per GPU. This isolates failure modes, but it wastes capacity. A 4B SLM call emits text snippets. An embedding call can be only a few milliseconds of GPU work. Reranking is bursty. Each service wants to stay warm, but none of them keeps a dedicated card saturated all the time.
Blackwell RTX 6000 Pro gives enough memory to keep the resident set on one card.
Figure 1: The shift is turning memory reserved by idle services into resident capacity on one accelerator, then adding the scheduling pieces that keep the shared GPU useful.
Each model is a separate local vLLM process. The gateway routes by model, applies per-model defaults, and forwards to the right localhost port.
A current GKE pod shows the shape we wanted: five vLLM engines resident on one Blackwell GPU, with MPS attached and room left in memory.
nvidia-smi reports the five engines as MPS clients and about 46 GiB used on a 98 GiB RTX PRO 6000 Blackwell. This exact resident set will change by workload, but it showed the useful constraint. Memory had stopped being scarce. Scheduling had become the bottleneck.
The first version proved that memory was not the blocker. The model set fit. The failures came from the interfaces between systems: CUDA context scheduling, container boundaries, Kubernetes MPS semantics, fleet routing, and queueing.
CUDA contexts were the first churn source
The first packed pod ran multiple vLLM processes on one GPU without CUDA MPS. It worked at idle and fell over under mixed load.
The heavy path being slow was expected. The failure was that small embedding requests started to slow down too. A request that should have been a few milliseconds of compute saw tail latency in the hundreds of milliseconds.
The models were resident. No weights were being swapped. The slowdown came from CUDA context time-slicing.
Without MPS, each vLLM process owns a separate CUDA context. A short embed kernel cannot reliably slip between longer generation kernels from another process causing it to wait behind the other context. From the API side, this looks like unexplained model churn: a cheap request pays for whatever else the GPU is doing.
With MPS engaged, the mixed workload changed shape. On the 32-concurrent hot-path benchmark, throughput moved from about 190 req/s to 230 req/s. Under heavy embedding contention, it moved from about 28 req/s to 83 req/s.
Figure 2: MPS helped most when short embed kernels were contending with longer work from another local vLLM process. The hot-path mix improved too, but the larger lesson was about removing avoidable serialisation.
The latency moved in the same direction:
| Config | Throughput | Chat p95 | Embed p95 | Rerank p95 |
| no MPS | 190 req/s | 162 ms | 300 ms | 51 ms |
| MPS | 230 req/s | 132 ms | 253 ms | 37 ms |
The useful result was narrower. Co-resident vLLM processes need GPU-side sharing support, or small kernels get serialised behind large ones.
The container boundary made MPS look broken
The first Docker attempt hit:
That error suggested a broad conclusion: maybe MPS and containers were not worth the trouble. The actual cause was narrower. The MPS control daemon ran as one user on the host, while the vLLM clients inside the containers ran as another. The pipe and IPC wiring existed, but the daemon and clients did not line up.
Running the MPS daemon as the same UID as the container clients, with host IPC and the right pipe mounts, fixed the issue. nvidia-smi showed the vLLM engine processes as MPS clients, and the mixed workload stopped behaving like a serialised queue.
Kubernetes changed the memory model
The next failure appeared after moving the same model set into Kubernetes.
Single-box MPS gave us compute sharing. Kubernetes MPS, through the GPU Operator, exposed multiple MPS-backed nvidia.com/gpu slices. That made scheduling explicit, but it also partitioned GPU memory.
With four MPS replicas on a 94.97 GiB card, each slice was about 23.75 GiB. The 4B SLM was configured with:
On the full card, that is about 28.49 GiB. On a one-quarter slice, it is too large. The 4B SLM crashed with:
In this mode, the rule is mechanical:
The same MPS label hid two different operating models. Bare MPS shared compute without the same equal VRAM cap. Kubernetes MPS shared compute and divided memory evenly. The resident model set had to be sized for the resource model Kubernetes actually enforced.
Routing needed locality, not even spread
Once one packed pod worked, the fleet router introduced a different churn source.
Evenly spreading the same model across all pods looks fair, but it fragments inference state. Each pod has its own prefix cache, continuous batching state, and queue. If requests for the same model are distributed everywhere, the cache gets colder and batches get thinner.
The router had to prefer affinity. Same-model traffic should stay on the preferred pod until that pod crosses a load threshold. Spillover still matters, but it should be a saturation response, not the default state.
Figure 3: Even spread looks fair from the load balancer's view, but it splits the serving state. Affinity keeps same-model traffic local until the preferred pod crosses a load threshold.
That keeps the local queue warmer, gives vLLM denser batches, and avoids treating even distribution as the default good.
For model serving, locality is capacity.
Different tasks needed different queues
The gateway also needed different backpressure for different task types.
Generation already has vLLM's continuous batcher. The gateway should not hold generation requests to assemble its own batch. It should protect the backend with admission control: concurrency, estimated token pressure, and priority.
Embeddings and reranking are one-shot calls. For those paths, per-request overhead matters enough that a few milliseconds of waiting can produce a better batch. They want deadline-aware micro-batching: hold briefly, flush at the earliest caller deadline, or flush when full.
The request path became:
Figure 4: Generate requests are admitted into vLLM's continuous batcher. Embed and rerank requests are briefly accumulated because one-shot work benefits from denser batches.
One queue was wrong because the failure modes are different. Generate can overfill KV/cache capacity. Embed and rerank can waste the GPU with thin one-shot work. The packed pod became a small scheduler: MPS for compute sharing, Kubernetes-aware memory sizing, affinity routing, and task-specific backpressure.
Startup became the next bottleneck
Packing the hot path into one pod raises the value of that pod. Replacement and scale-up have to be fast.
The cold-start path had four separate bills:
| Startup work | What it costs | What a snapshot can skip |
| node and image setup | boot, pull, runtime setup | only if the image/runtime is already present |
| weight reads | large sequential reads into host/GPU memory | not really; the bytes still move |
| engine init | Python import, CUDA setup, graph capture, warmup | yes, if captured at the right point |
| restore reads | checkpoint bytes read back into the process | no; this becomes the new bottleneck |
That is the useful way to think about snapshots. Stop replaying expensive initialisation, then pay to read back the state you saved.
Snapshot restore looked like the right primitive for the generate path: warm the 4B SLM, checkpoint the process after vLLM has done its expensive setup, and restore it instead of cold-loading from scratch. In warm-cache tests, this worked well. On a fresh node, the numbers changed:
| Restore path | Time |
| warm restore | ~23 s |
| fresh-node restore | ~107 s |
| cold-read delta | ~84 s |
On the fresh-node run, the agent-level breakdown was:
| Fresh-node component | Time |
| criu_restore | 61 s |
| nsrestore | 45 s |
The warm number was a floor, not a cold-start number. The fresh-node run still had to read the checkpoint bytes. If that read comes from slow shared storage, restore inherits the cold-read latency.
Once image pull and node boot are out of the critical path, scale-up becomes a bandwidth problem. The question changes from “can we checkpoint the process?” to “can we serve the checkpoint bytes fast enough for the checkpoint to matter?”
Weights and checkpoints are different artifacts
The storage design became clearer once we separated two artifacts that are easy to mix together.
Weights are immutable in the serving path. Populate them once, then fan them out read-only to every GPU node. That fits a read-only multi-attach store such as Hyperdisk ML on GKE.
Checkpoints are different. The snapshot agent writes the CRIU/CUDA dump, and workers later read it back. That artifact needs writable shared storage. A read-only-many disk can be a good fit for weights and the wrong primitive for checkpoints.
The contract became:
| Artifact | Access pattern | Store shape |
| weights | immutable, populate once, fan out to every GPU node | read-only multi-attach |
| checkpoints | snapshot agent writes CRIU dump, workers read restore bytes | writable shared storage |
Weights want read-only fanout. Checkpoints need a writable restore path. Fast read-only storage improves weight reads; it does not, by itself, solve writable checkpoint restore.
The storage bandwidth numbers explain why this mattered:
| Store / path | Read bandwidth |
| GCS, single stream | 0.20 GB/s |
| GCS, parallel/sliced | 0.60 GB/s |
| Local SSD, 1 device | 0.75 GB/s |
| Local SSD, RAID0 x2 | 1.49 GB/s |
| G4 single-GPU Hyperdisk cap | 1.68 GB/s |
| Hyperdisk ML measured on c3-standard-44 | 2.5 GB/s |
Once scale-up is waiting on bytes, the storage primitive becomes part of the serving path. The measured Hyperdisk ML result is from c3-standard-44; the G4 single-GPU Hyperdisk cap is lower. At the G4 cap, the same cold read that cost about 84 seconds maps to roughly 10 seconds of read time. That turns the fresh-node restore from a minute-plus event into something close to the warm-cache floor.
Snapshot the expensive init
Snapshotting is not automatically the right default for every resident model.
A snapshot saves initialisation time, but it creates bytes that must be read during restore. It also creates a compatibility contract. The checkpoint belongs to a specific runtime shape: GPU class, driver, serve arguments, model length, memory budget, and warmup profile. Change the shape and the safest answer is to create a new checkpoint.
For the 4B SLM, that trade is good: save roughly 30 seconds of initialisation, pay only a few seconds of extra read time, and keep one snapshot profile tied to the generate path.
For the query embedder or document embedder, the trade can flip. If cold-loading costs only a few seconds, and the snapshot adds more read overhead than it saves in init time, restore loses.
The rule is:
| Model class | Snapshot tradeoff | Decision |
| 4B SLM | saves ~30 s init, pays only a few seconds of extra read | snapshot |
| query/document embedder | saves ~2-3 s init, can add more read overhead than it saves | cold-load |
For a packed pod, this points to one heavy snapshot plus cold-loaded embedders and rerankers, not a checkpoint for every resident model. It is also less fragile operationally. Restoring several GPU processes means coordinating VRAM, reconnecting through MPS, and bringing the set back in the right shape. The packed pod should restore the expensive resident state and rebuild the cheap state normally.
Results
The design that held up was boring in the right places. Each model kept its own process and serving defaults. The gateway owned admission, batching, and routing. MPS handled GPU-side sharing. Kubernetes owned placement and memory slices. Storage stayed explicit about which bytes were immutable weights and which bytes were writable checkpoints.
That split matters because a packed GPU fails through boundaries, not through the model list. A CUDA context boundary can serialise short kernels. A container boundary can make MPS look enabled while clients cannot connect. A fleet boundary can scatter one model's cache state across too many pods. A storage boundary can turn a warm restore into a minute of checkpoint reads.
The useful rule is simple: pack resident compute, then keep the serving state local enough to make the packing pay. Snapshot the initialisation that is expensive to replay. Cold-load the small models when their checkpoint bytes cost more than their startup. Scale on queue pressure, because a fresh packed pod still needs time before it can take traffic.
A single GPU can behave like a small model fleet when the fleet mechanics stay in place. The system still needs memory accounting, admission, micro-batching, locality, and a startup path that knows which bytes matter. The win was fewer idle GPUs, with enough scheduling around the accelerator to keep requests from fighting over it.