Designing Capacity Reservation for Deep Research

When hundreds of AI tasks share the same LLM provider rate limits, static capacity estimates waste throughput and dynamic forecasting breaks at phase boundaries. This post explains how we built an adaptive reservation system that sizes capacity based on what tasks actually consume, separated by mode and phase, to maximise throughput without breaching rate limits.

The problem: Distributed coordination

Services like OpenRouter and Vercel AI Gateway route requests to LLM providers on behalf of their users. For individual calls, that routing is a solved problem. These gateways handle fallbacks, retries and model selection per call. But they don’t coordinate capacity across an entire workflow where dozens of calls share the same model provider rate limit. LLM providers enforce per-minute limits on requests and tokens. A single provider’s rate limits aren't enough for high throughput workloads, so systems route the same model through multiple providers to get more total capacity. With agentic workflows, where a single task triggers dozens of LLM calls across different models and providers, each call draws from a shared pool of rate limits that every other concurrent task also depends on. That makes capacity management a distributed coordination problem.

At Valyu, our Deep Research service runs thousands of concurrent tasks, each making many LLM calls across several providers and models. Each task needs to preempt capacity to a model provider before it runs. A problem arises when the system either reserves too much capacity upfront and throughput goes to waste or reserves too little and model APIs start rate limiting requests.

What makes multi-step capacity harder than single-request routing?

The token profile flips between phases: Each deep research task passes through distinct phases with completely different token profiles. The research phase of a deep research task fans out into multiple parallel calls which searches and summarises the results, each consuming high input tokens but producing few output tokens. The writing phase of a deep research task inverts that, synthesising everything into a long form report in a single call that consumes less input tokens but produces much more output tokens. The ratio between input and output tokens can flip by an order of magnitude between phases, while both drawing from the same provider limits.

Fig 1. Token profile flips between phases

Deep research task modes vary widely: Our API lets users configure how deep a research task goes, from a quick summary to an exhaustive investigation. Tasks range from a quick task finishing in under ten minutes to a deep task running for over two hours, sometimes more than four hours! A short task makes a handful of model calls, whereas a long task might make dozens across different models. All of them draw from the same per-provider rate limits.
At scale, provider rate limits leave almost no extra capacity: LLM providers enforce per-minute limits on both requests and tokens. Split across many concurrent tasks, the available headroom per-task gets thin. A single capacity reservation off by 10–15% is enough to breach this limit. Individual tasks can also burst unpredictably when they retrieve large amounts of context from search results, spiking input token consumption well above the estimate. When that limit is breached, the provider rejects requests for the remainder of the rate limit window, which causes errors to propagate to every other concurrent task that depends on that model from that provider until the window resets.
Priority routing to certain providers creates concentration risk: We have preferred providers for each model based on reliability and latency, so tasks route to providers in priority order: the preferred provider first, with fallbacks behind it. The primary provider, by design therefore, absorbs the majority of the load, which can be risky. If a task TPM (token per minute) usage happens to be miscalculated, every task routed there stalls until the rate limit window resets. A reservation error on the primary cascades into queue delays for every task. Also, the system needs resilience. If the primary provider goes down or starts returning errors, all that load needs to cascade to fallback providers without breaching their limits in turn.
Concurrent bursts amplify problems: Tasks that start around the same time increasingly consume more tokens around the same time. Our API supports batch requests where users submit many research tasks at once, meaning dozens of tasks kick off simultaneously and tend to hit the same phases at the same time. When several tasks transition from research to writing phase simultaneously, the demand profile shifts across multiple tasks within seconds. Each individual reservation fits within limits in isolation, but collectively causes them to exceed the limit.
Capacity spans multiple models and providers: Different models have different strengths. Some are faster, while others produce higher quality outputs. The system routes across multiple models and providers simultaneously, tracking utilisation independently for each combination and coordinating reservations across all of them in real time.

Fig 2. Correlated phase-transition burst

In our case, we designed it such that:

Each deep research task runs inside a container on a Kubernetes cluster. A container picks up a task, orchestrates its phases and delivers the final report. Hundreds of these containers run concurrently, each executing tasks independently, but all of them make LLM calls against the same set of provider rate limits. No single container has a complete picture of cluster-wide usage, hence the system needs a shared state that every container can read and write to in real time, and an orchestration layer that coordinates reservations across all of them without bottlenecking throughput (figure 3).

The system has two layers. An infrastructure layer handles routing, reservation, and coordination across hundreds of containers. An adaptive layer sits on top, adjusting reservation sizes based on what tasks actually consume rather than static worst-case estimates (figure 4). Additionally we designed it around a few core principles:

Config-driven: Capacity limits and routing logic should live as a configuration so we can respond to provider changes without having to redeploy code.
Strong consistency: Capacity leases should be synced across 100s of distributed containers with no room for race conditions when checking or reserving capacity.
Graceful degradation: When provider capacity runs out, tasks should queue and not fail.
Adaptive: Reservations should reflect what tasks actually use to minimise wasted capacity without under-reserving or trading throughput for rate limit violations.

Fig 3. System overview

Fig 4. Reservation flow

Config over code (Config-driven)

Deep research workflow definitions live in configuration, from what agents each mode needs, to what tools they use and what capacity they require. Everything can be adjusted without redeployment. At the capacity level, each agent carries an ordered list of models with overflow thresholds:

TypeScript

interface AgentCapacityBlock {
  models: { model: string; overflowThreshold: number }[];
  inputTokens: number;
  outputTokens: number;
  parallelCalls: number;
}

When the capacity system receives a task for a given mode, it fetches the latest utilisation data for each model in the agent’s list and selects the first one running below its overflow threshold. If the preferred model is saturated, the task overflows to the next model in the list automatically.

A circuit breaker tracks consecutive failures per provider. When a provider crosses the failure threshold, routing to that provider stops across all containers until it recovers. Provider outages and rate limit changes can be updated by editing the config.

Shared state in Redis (Strong consistency)

Hundreds of containers run concurrently, each processing multiple tasks, requiring every container to have a synchronised view of cluster wide capacity usage. A redis cluster coordinates this, with each model from each provider having three sorted sets: tracking RPM (requests per minute), input TPM, and output TPM. Every container reads from and writes to these same sets, without the overhead of direct container-to-container coordination.

Redis also handles other cross container states: a user credit cache with atomic decrements to prevent double-spending race conditions, a single user might have tasks running across ten or more containers simultaneously, all billing against the same account, so every credit deduction must be atomic to prevent concurrent tasks from spending past the user’s balance. It also stores failure counters that propagate provider failures to all instances to allow for provider fallbacks, and lifecycle profile data used by the adaptive reservation layer.

Each container maintains local in-memory caches for status polling, API key validation, and context tracking.

Atomic reservations (Strong consistency)

A reservation must check RPM, input TPM, and output TPM against their limits for every model a task needs, and either reserve all of them or none. A Lua script that runs atomically inside Redis enforces this by checking all three dimensions for each model, writing the task into each sorted set if everything passes, returning a at capacity flag if any check fails.

Some tasks use multiple models concurrently. All models are reserved atomically, to prevent cases where making two reservations with one succeeding and one failing would leave a dangling reservation.

On task completion, a release closure removes the task from all sorted sets. If a container crashes without releasing, entries auto-expire via TTL.

Handling phase transitions (Strong consistency, Graceful degradation)

When a task moves between phases, its capacity needs to change. Consider a task that uses three models: Model A (used in research only), Model B (used in both research and writing), and Model C (used in research only). When the task transitions from research to writing: Model A is released entirely since the task no longer needs it, so its reservation is removed and that capacity becomes available for other tasks. Model B is shared across both phases, but the research phase reserved it for high input tokens while the writing phase needs it for high output tokens. Rather than releasing and re-reserving Model B which would leave a gap where another task could claim that capacity, the system overwrites Model B's reservation in place with the writing phase's token estimate. This atomic swap frees the unneeded input token headroom while keeping the reservation held. Model C is new to the writing phase, so it goes through the standard reservation logic. If Model C's capacity is unavailable, the task queues until it opens up rather than proceeding without it.

Queues and fairness (Graceful degradation)

When capacity is unavailable, tasks wait in a queue rather than failing. Two queues run in parallel: one for tasks waiting on capacity for a later phase, and one for new tasks that cannot be admitted. The queue processor prioritises nearly-complete tasks to free up capacity fastest when they finish. This fairness rule prevents new incoming requests from jumping ahead of already-queued tasks.

All of the above assumes we know how many tokens each task will use before it starts. Right now, each task mode has its own static token estimate in the config, a quick task reserves one amount, a standard task reserves more, a deep task reserves the most. But within a given mode, every task reserves the same fixed number regardless of the actual query. Some runs consume far less than the estimate, wasting capacity that other tasks could use. Other runs exceed the estimate and risk breaching provider limits. The coordination layer works, yet the reservation sizes are the bottleneck.

Implementing an adaptive reservation layer

Three initial approaches to replacing static estimates

The goal was to replace static estimates with predictions that reflect what tasks actually consume. We built a simulation environment: Poisson task arrivals, real token distributions from production, the same Lua scripts and Redis structures used in prod, and tested 3 approaches.

Polynomial regression fits a polynomial to historical token consumption across all concurrent tasks and uses it to forecast demand for the next reservation window.

Fig 5. Task level token variance across phases

Mid-task reservation adjustment monitors each task's actual consumption against its reservation and shrinks the reservation for tasks running ahead of schedule, freeing capacity for other tasks.
Multiplexing removes per-task reservation blocks entirely and lets tasks share capacity probabilistically, sizing the total pool based on statistical utilisation rather than individual reservations.

Why aggregate forecasting breaks down at phase boundaries

Each approach improved throughput, but each also introduced rate limit violations that tuning alone could not eliminate.

Polynomial regression added 24% more throughput at high concurrency, proving that large gains were possible, but the error margins of the predictions between the predicted and actual usage of tokens during phase transitions exceeded safety margins by 5–10x. A single polynomial averages across both, over-predicting for one phase and under-predicting for the other. Fitting separate polynomials per phase doesn't help either. Looking at Fig 4, even within a single phase, token consumption fluctuates significantly from call to call depending on how much context it synthesises in each call. A polynomial fitted to this noisy signal can't distinguish between a temporary dip and a genuine trend, so its predictions oscillate with the same variance it's trying to smooth out. (Figure 6a)
Mid-task adjustment revealed how tightly coupled routing decisions are to reservation sizes. When a task uses less than its reservation, the system shrinks the reservation to free capacity, which the router sees and immediately admits a new task, increasing load on the same provider. Now the provider carries more total load. As seen with polynomial regression, token consumption fluctuates significantly within a phase, so the original task’s call may spike above its reduced reservation. But that headroom is now held by the newly admitted task. The original task either breaches the provider's rate limit, causing failures that propagate to every task on that provider for the remainder of the rate window, or it queues waiting for capacity that won't free up until the new task finishes. Now as the other tasks get shrunk, the system recursively admits more tasks until there is no headroom left for any task to absorb a burst. (Figure 6b)
Multiplexing maximised throughput but made concurrent token usage bursts across multiple tasks uncontrollable. Per-task isolation to solve this introduced too much scheduling overhead. Splitting a shared pool by phase didn't solve it either. (Figure 6c)

Fig 6. Why each approach failed

Together, these experiments narrowed the problem. All three approached it as a high level forecasting problem, predicting total tokens across all tasks at some future point. What broke each approach was the same problem where the difference in token usage between phases was too drastic. That pointed the solution toward tracking what a typical task actually consumes in each phase, rather than forecasting aggregate demand. This led us to Predictive Phased P80 (PP80) which sizes reservations based on the 80th percentile of observed token usage, separated by task mode and phase.

Phased P80 (PP80)

PP80 records actual output token usage from every completed task, separated by mode and phase, and uses the 80th percentile to size future reservations.

Every completed task writes its output tokens to a Redis sorted set keyed by mode and phase, scored by timestamp, capped at a maximum number of entries, with a 24-hour TTL. For new reservations, the system computes P80 and scales the static estimate:

P80 was selected after testing P50 through P95. P50 under-reserves for half of the tasks. P95 barely saves anything over static. P80 gives 15–30% savings while ensuring 80% of tasks fit within their reservation, with the remaining 20% having headroom from completed tasks releasing capacity. Without enough samples, the system falls back to static estimates and transitions to the PP80 system automatically as usage data accumulates.

Fig 7. Percentile selection trade-off

Tuning the predictions

As task mix and provider limits change, PP80 predictions shift. We layer an exponential moving average to track system prediction error:

If tasks consistently exceed predictions, the exponential moving average (EMA) nudges future predictions up. The correction is clamped to prevent outliers from overly skewing the limits. The updates to the capacity blocks run as a Lua script to prevent stale task completions from racing on the read-compute-write operation.

All output reservations take a flat 5% reduction, which improves throughput while keeping net negative violations. Combined with the adaptive prediction, the final reserved output tokens for a given model become:

Apart from tracking output tokens per phase, we also track per model per phase. The writing phase uses one model for long-form report generation and another for structured output like citations and metadata. If PP80 only tracks the phase as a whole, it computes a single P80 across both models combined. When that P80 is used to size the report model's reservation, it includes the tokens that the structured model consumes, over-reserving the report model by ~25%. Splitting the tracking so each model has its own P80 means the report model and the structured model each reserves based on its own token usage. This frees up capacity on the report model's provider, which is where output limits bind hardest during phase-transition bursts.

Results under high concurrency

The PP80 system was validated in a staging stress test under high concurrency across multiple model providers.

Fig 8. Stress test results

The queue processor admitted tasks immediately when capacity was available and held the rest until slots freed up, draining the full queue in under 11 minutes. Zero rate limit failures across all providers. With static estimates, provider rate limits supported a maximum of 57 concurrent deep research tasks before queuing kicked in. PP80’s smaller per-task reservations raise that ceiling to approximately 71 concurrent tasks, a 25% increase, without any changes to provider rate limits.

Fig 9. Completion time distribution

What's next

The core insight was that aggregate forecasting was the wrong framing entirely. Observing what individual tasks actually consume, separated by mode and phase, turned a prediction problem into a distribution problem.

But constraints shift as provider rate limits grow more granular and task workloads evolve, and the system is evolving with them. PP80 cold-starts on new task modes with zero history, the 5% output reduction is still empirical rather than dynamic, and per-minute sorted sets will need rethinking as providers introduce burst allowances and better tools for context caching. For now, PP80 remains our answer to the core challenge: sizing capacity reservations that are tight enough to maximise throughput without breaching rate limits that bind the whole system.

Capacity reservation is one piece of what makes our Deep Research workflows run reliably at scale. Try Valyu Deep Research and see it in action at https://platform.valyu.network .

The problem: Distributed coordination

What makes multi-step capacity harder than single-request routing?

In our case, we designed it such that:

Implementing an adaptive reservation layer

Three initial approaches to replacing static estimates

Why aggregate forecasting breaks down at phase boundaries

Phased P80 (PP80)

Tuning the predictions

Results under high concurrency

What's next

Related Blogs

Valyu DeepResearch sets a new state-of-the-art on ScholarQA

OpenClaw & Valyu: Your AI Chief of Operations & Research Staff

DeepSearch Financial Update: Empower Your AI with Deeper Financial Data

Why We Built DeepSearch API: Because Your AI Needs Facts, Not Vibes