Model failover
Verified Code examples on this page have been automatically tested and verified.Prioritize the failover of requests across different models from an LLM provider. Include outlier detection of unhealthy LLM backends to automatically fail over when getting throttled by an unperformant model.
About failover
Use failover (automatic fallback) to keep services running by switching to a backup when the main system fails or becomes unavailable.
For agentgateway, you can set up failover across models and LLM providers. When a provider becomes unhealthy (such as returning errors or getting rate-limited), the system automatically switches to a backup provider. This configuration keeps the service running without interruptions.
Failover in agentgateway has two parts:
- Priority groups in the AgentgatewayBackend define the failover order. Each group is a tier. Models within the same group are load balanced equally. When all models in a group are evicted, requests fail over to the next group.
- A health policy in an AgentgatewayPolicy defines what counts as an unhealthy response (such as 5xx errors or 429 rate limits) and how to evict unhealthy backends. Without a health policy, backends are not evicted and failover does not occur.
This approach increases the resiliency of your network environment by ensuring that apps that call LLMs can keep working without problems, even if one model has issues.
Example flow
Failover works through backend eviction, as described in the following diagram.
flowchart LR
A[Response arrives from provider] --> B{Unhealthy backends?}
B -->|"Yes (e.g. 5xx, 429)"| C[Evict backend from priority group]
B -->|No| H[Complete request]
C --> D{All backends in group evicted?}
D -->|Yes| F[Fail over to next priority group]
D -->|No| G[Route to remaining backends in group]
C --> J["Restore backend after eviction duration"]
- A response arrives from a provider.
- The
unhealthyConditionCEL expression is evaluated. Iftrue, the response is marked unhealthy. - If eviction thresholds are met (such as
consecutiveFailures), the backend is evicted from its priority group for the configuredduration. - When all backends in a priority group are evicted, the load balancer automatically routes to the next available group.
- Evicted backends are restored after their eviction duration expires. The eviction duration uses multiplicative backoff on repeated evictions.
Rate-limit handling: When a 429 response includes a Retry-After header, agentgateway uses that duration as the eviction time (overriding the configured duration). However, 429 responses only trigger eviction if your unhealthyCondition includes them (for example, response.code >= 500 || response.code == 429).
Failover vs. traffic splitting
Failover uses priority groups to automatically switch between backends when failures occur.
For weight-based traffic distribution (A/B testing, traffic splitting, or canary deployments), see Traffic splitting.
Before you begin
- Set up an agentgateway proxy.
- Set up API access to each LLM provider that you want to use. The examples in this guide use OpenAI and Anthropic.
Fail over to other models
You can configure failover across multiple models and providers by using priority groups. Each priority group represents a set of providers that share the same priority level. Failover priority is determined by the order in which the priority groups are listed in the AgentgatewayBackend. The priority group that is listed first is assigned the highest priority.
Models within the same priority group are load balanced using the Power of Two Choices (P2C) algorithm, which intelligently routes requests based on health, latency, and current load, not just simple round-robin. This pattern of P2C load balancing within a tier with failover across tiers provides superior performance compared to named strategies.
For weight-based traffic distribution within a priority group (such as 80/20 splits for A/B testing or canary rollouts), see Traffic splitting.
Create or update the AgentgatewayBackend for your LLM providers.
In this example, you configure separate priority groups for failover across multiple models from the same LLM provider, OpenAI. Each model is in its own priority group. The order of the groups determines the failover priority. If the first model is evicted, requests fail over to the second group, and so on.
- OpenAI
gpt-4.1model (highest priority) - OpenAI
gpt-5.1model (fallback) - OpenAI
gpt-3.5-turbomodel (lowest priority)
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: model-failover namespace: agentgateway-system spec: ai: groups: - providers: - name: openai-gpt-41 openai: model: gpt-4.1 policies: auth: secretRef: name: openai-secret - providers: - name: openai-gpt-51 openai: model: gpt-5.1 policies: auth: secretRef: name: openai-secret - providers: - name: openai-gpt-3-5-turbo openai: model: gpt-3.5-turbo policies: auth: secretRef: name: openai-secret EOFIn this example, you configure failover across multiple providers with cost-based priority. The first priority group contains cheaper models. Responses are load-balanced across these models. In the event that both models are unavailable, requests fall back to the second priority group of more premium models.
- Highest priority: Load balance across cheaper OpenAI
gpt-3.5-turboand Anthropicclaude-haiku-4-5-20251001models. - Fallback: Load balance across more premium OpenAI
gpt-4.1and Anthropicclaude-opus-4-6models.
Make sure that you configured both Anthropic and OpenAI providers.
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayBackend metadata: name: model-failover namespace: agentgateway-system spec: ai: groups: - providers: - name: openai-gpt-3.5-turbo openai: model: gpt-3.5-turbo policies: auth: secretRef: name: openai-secret - name: claude-haiku anthropic: model: claude-haiku-4-5-20251001 policies: auth: secretRef: name: anthropic-secret - providers: - name: openai-gpt-4.1 openai: model: gpt-4.1 policies: auth: secretRef: name: openai-secret - name: claude-opus anthropic: model: claude-opus-4-6 policies: auth: secretRef: name: anthropic-secret EOF- OpenAI
Create an HTTPRoute resource that routes incoming traffic on the
/modelpath to the AgentgatewayBackend that you created in the previous step. In this example, the URLRewrite filter rewrites the path from/modelto the path of the API in the LLM provider that you want to use, such as/v1/chat/completionsfor OpenAI.kubectl apply -f- <<EOF apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: model-failover namespace: agentgateway-system spec: parentRefs: - name: agentgateway-proxy namespace: agentgateway-system rules: - matches: - path: type: PathPrefix value: /model backendRefs: - name: model-failover namespace: agentgateway-system group: agentgateway.dev kind: AgentgatewayBackend EOFCreate an AgentgatewayPolicy with a health policy that targets the AgentgatewayBackend. The health policy defines which responses are considered unhealthy and how to evict backends. Without this policy, backends are not evicted and failover does not occur.
The
unhealthyConditionfield is an optional CEL expression that classifies each response. When you set it,truemeans the response counts as unhealthy toward eviction. Theevictionsettings control how many failures and how long an unhealthy backend stays out of its priority group.Review the following table to understand this configuration.This configuration evicts backends on both server errors (5xx) and rate-limit responses (429). This way, when you get throttled by one LLM provider, agentgateway automatically fails over to another.
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayPolicy metadata: name: model-failover-health namespace: agentgateway-system spec: targetRefs: - group: agentgateway.dev kind: AgentgatewayBackend name: model-failover backend: health: unhealthyCondition: "response.code >= 500 || response.code == 429" eviction: duration: 10s consecutiveFailures: 1 EOFThis configuration evicts backends only on server errors (5xx) or connection failures. Rate-limited (429) responses lower the backend’s health score but do not trigger eviction.
kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayPolicy metadata: name: model-failover-health namespace: agentgateway-system spec: targetRefs: - group: agentgateway.dev kind: AgentgatewayBackend name: model-failover backend: health: unhealthyCondition: "response.code >= 500" eviction: duration: 10s consecutiveFailures: 3 EOFSetting Description unhealthyConditionOptional CEL expression that classifies each response as healthy or unhealthy. When you set this field, truemeans the response counts as unhealthy toward eviction (together witheviction). When you omit this field, 5xx responses and connection failures still lower the backend health score for load balancing, but that built-in behavior does not trigger eviction by itself.eviction.durationBase time to remove an unhealthy backend from its priority group. Increases with multiplicative backoff on repeated evictions. When a 429 response includes Retry-After, that value is used instead. You might try10s–60sdepending on how quickly you want failover versus avoiding flapping on brief errors. Shorter durations fail over faster. If you omit this field, the default is3s.eviction.consecutiveFailuresNumber of consecutive unhealthy responses required before evicting. You might start with 3so that a single transient error does not evict the backend. For tests, use1for immediate eviction.Verify that failover works by temporarily configuring the health policy to treat all responses as unhealthy. This policy forces each backend to be evicted after its first response, so you can watch requests progress through the priority groups.
Update the AgentgatewayPolicy to set
unhealthyConditionto"true":kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayPolicy metadata: name: model-failover-health namespace: agentgateway-system spec: targetRefs: - group: agentgateway.dev kind: AgentgatewayBackend name: model-failover backend: health: unhealthyCondition: "true" eviction: duration: 30s consecutiveFailures: 1 EOFSend multiple requests in sequence. Check the
modelfield in each response to confirm that requests progress through the priority groups as each backend is evicted.for i in 1 2 3; do echo "=== Request $i ===" curl -s "$INGRESS_GW_ADDRESS/model" -H content-type:application/json -d '{ "messages": [{"role": "user", "content": "Say hello in one word."}] }' | jq '{model, status: .choices[0].finish_reason}' echo donefor i in 1 2 3; do echo "=== Request $i ===" curl -s "localhost:8080/model" -H content-type:application/json -d '{ "messages": [{"role": "user", "content": "Say hello in one word."}] }' | jq '{model, status: .choices[0].finish_reason}' echo doneWith the OpenAI model priority configuration, each request evicts the current group’s backend and the next request routes to the next group. You can see the
modelfield change with each request:=== Request 1 === { "model": "gpt-4.1-2025-04-14", "status": "stop" } === Request 2 === { "model": "gpt-5.1-2025-04-14", "status": "stop" } === Request 3 === { "model": "gpt-3.5-turbo-0125", "status": "stop" }With the cost-based configuration, the first two requests are load balanced across the two providers in the first priority group. After both are evicted, the third request fails over to the second priority group:
=== Request 1 === { "model": "gpt-3.5-turbo-0125", "status": "stop" } === Request 2 === { "model": "claude-haiku-4-5-20251001", "status": "stop" } === Request 3 === { "model": "gpt-4.1-2025-04-14", "status": "stop" }
Now that you tested failover, restore the health policy to your production configuration. Reapply the policy from step 3 with your unhealthyCondition settings (such as response.code >= 500 || response.code == 429, where >= 500 matches HTTP 5xx server errors).
Cleanup
You can remove the resources that you created in this guide.kubectl delete AgentgatewayBackend model-failover -n agentgateway-system
kubectl delete AgentgatewayPolicy model-failover-health -n agentgateway-system
kubectl delete httproute model-failover -n agentgateway-systemNext
Explore other agentgateway features.
- Learn more about load balancing strategies and the P2C algorithm.
- Pass in functions to an LLM to request as a step towards agentic AI.
- Set up prompt guards to block unwanted requests and mask sensitive data.
- Enrich your prompts with system prompts to improve LLM outputs.