Load Balancing & Fallback Models

This guide explains how to configure and use load balancing and fallback models in the LiteLLM proxy server. These features ensure high availability, optimal performance, and automatic failover when models are unavailable or rate-limited.

Load Balancing

Load balancing distributes requests across multiple deployments of the same model to:

Maximize throughput by utilizing multiple API keys/deployments
Avoid rate limits by spreading load across different endpoints
Improve reliability by having redundant deployments
Optimize costs by routing to the most available deployment

Supported Strategies

1. Usage-Based Routing v2 (Recommended)

Routes requests based on real-time TPM (Tokens Per Minute) and RPM (Requests Per Minute) availability.

router_settings:
  routing_strategy: "usage-based-routing-v2"
  enable_pre_call_checks: true

How it works:

Tracks token and request usage in real-time
Routes to deployment with most available capacity
Prevents hitting rate limits before they occur
Automatically updates based on actual usage

2. Simple Shuffle

Randomly distributes requests across available deployments.

router_settings:
  routing_strategy: "simple-shuffle"

Use case

Equal load distribution without capacity tracking

3. Latency-Based Routing

Routes to the fastest responding deployment.

router_settings:
  routing_strategy: "latency-based-routing"

Use case

Minimize response time for latency-sensitive applications

4. Least Busy

Routes to the deployment handling the fewest requests.

router_settings:
  routing_strategy: "least-busy"

Use case

Prevent overloading individual deployments

Load Balancing Configuration

model_list:
  # Multiple deployments of the same model for load balancing
  - model_name: gpt-4o  # Same model name
    litellm_params:
      model: azure/gpt-4o-2  # Different deployment
      api_base: https://openai-248.openai.azure.com/
      api_key: YOUR_KEY_1
      client: azure
    model_info:
      tpm: 1200000  # Tokens per minute limit
      rpm: 500      # Requests per minute limit

  - model_name: gpt-4o  # Same model name
    litellm_params:
      model: azure/gpt-4o-3  # Different deployment
      api_base: https://openai-248.openai.azure.com/
      api_key: YOUR_KEY_2
      client: azure
    model_info:
      tpm: 916000
      rpm: 500

  - model_name: gpt-4o  # Same model name
    litellm_params:
      model: azure/gpt-4o-4  # Different deployment
      api_base: https://openai-248.openai.azure.com/
      api_key: YOUR_KEY_3
      client: azure
    model_info:
      tpm: 977000
      rpm: 500

router_settings:
  routing_strategy: "usage-based-routing-v2"
  num_retries: 3
  enable_pre_call_checks: true  # Check TPM/RPM before routing
  cooldown_time: 60  # Cooldown period after rate limit (seconds)

Key Points:

Use same model_name for all deployments you want to load balance
Use different deployment names in litellm_params.model
Specify accurate TPM/RPM limits for optimal routing
Enable pre_call_checks to prevent rate limit errors

How Requests Are Routed

Client Request: "gpt-4o"
         ↓
    Router checks:
    - Which gpt-4o deployments are available?
    - Which has most available TPM/RPM?
    - Any deployments in cooldown?
         ↓
    Routes to: azure/gpt-4o-3 (most available capacity)
         ↓
    If rate limited → Cooldown 60s → Try azure/gpt-4o-2

Fallback Models

Fallback models are alternative models used when the primary model:

Is rate-limited
Times out
Returns an error
Is temporarily unavailable

1. Model-Specific Fallbacks

Define fallbacks for specific models:

litellm_settings:
  fallbacks:
    - {gpt-5-chat: [gpt-4o]}          # gpt-5-chat → gpt-4o
    - {gpt-5-mini: [gpt-4o]}          # gpt-5-mini → gpt-4o
    - {gpt-5-nano: [gpt-5-mini, gpt-4o]}  # gpt-5-nano → gpt-5-mini → gpt-4o
    - {gpt-35-turbo: [gpt-4o]}        # gpt-35-turbo → gpt-4o

Fallback Chain:

Request: gpt-5-nano
    ↓
Try: gpt-5-nano (failed)
    ↓
Try: gpt-5-mini (failed)
    ↓
Try: gpt-4o (success) ✓

2. Context Window Fallbacks

Automatically fallback when context is too large:

litellm_settings:
  context_window_fallbacks:
    - {gpt-4o: [gpt-5-chat]}  # If gpt-4o context exceeded → use gpt-5-chat

Use case

Request with 200K tokens → gpt-4o (128K limit) → gpt-5-chat (256K limit)

3. Default Fallbacks

Fallback for models without specific configuration:

litellm_settings:
  default_fallbacks: [gpt-4o]  # Any undefined model → gpt-4o

Important

The default fallback model should have HIGH RATE LIMITS (high TPM/RPM) since it will handle overflow traffic from all other models. Choose a model with the highest quota or multiple load-balanced deployments to prevent it from becoming a bottleneck.

Recommended Default Fallback Configuration:

# GOOD: Default fallback with high limits and load balancing
model_list:
  - model_name: gpt-4o  # Default fallback
    litellm_params:
      model: azure/gpt-4o-2
      api_key: KEY_1
    model_info:
      tpm: 1200000  # High limit
      rpm: 500

  - model_name: gpt-4o  # Load balanced
    litellm_params:
      model: azure/gpt-4o-3
      api_key: KEY_2
    model_info:
      tpm: 916000
      rpm: 500

  - model_name: gpt-4o  # Load balanced
    litellm_params:
      model: azure/gpt-4o-4
      api_key: KEY_3
    model_info:
      tpm: 977000
      rpm: 500

litellm_settings:
  default_fallbacks: [gpt-4o]  # Uses load-balanced gpt-4o with 3M+ total TPM

# BAD: Low-limit model as default fallback
litellm_settings:
  default_fallbacks: [gpt-5-nano]  # Only 100K TPM - will bottleneck!

Retry Policy Configuration

Control which errors trigger fallbacks:

router_settings:
  retry_policy:
    ContentPolicyViolationErrorRetries: 0  # Never retry content violations
    BadRequestErrorRetries: 0              # Never retry bad requests
    TimeoutErrorRetries: 3                 # Retry timeouts 3 times
    InternalServerErrorRetries: 3          # Retry server errors 3 times
    RateLimitErrorRetries: 3               # Retry rate limits 3 times

Allowed Fails Configuration

Set how many failures before moving to fallback:

router_settings:
  allowed_fails_policy:
    ContentPolicyViolationErrorAllowedFails: 0  # Fail immediately
  allowed_fails: 0  # Default for other errors
  cooldown_time: 60  # Cooldown after rate limit

Configuration Examples

Example 1: High-Availability Setup

Goal: Maximum uptime with automatic failover

model_list:
  # Primary: 3x gpt-4o deployments (load balanced)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-2
      api_base: https://openai-248.openai.azure.com/
      api_key: KEY_1
      client: azure
    model_info:
      tpm: 1200000
      rpm: 500

  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-3
      api_base: https://openai-248.openai.azure.com/
      api_key: KEY_2
      client: azure
    model_info:
      tpm: 916000
      rpm: 500

  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-4
      api_base: https://openai-248.openai.azure.com/
      api_key: KEY_3
      client: azure
    model_info:
      tpm: 977000
      rpm: 500

  # Fallback: gpt-5-mini (cheaper, faster)
  - model_name: gpt-5-mini
    litellm_params:
      model: azure/gpt-5-mini
      api_base: https://sohan-mbtd9z9j-eastus2.openai.azure.com/
      api_key: KEY_4
      client: azure
    model_info:
      tpm: 120000
      rpm: 500

litellm_settings:
  fallbacks:
    - {gpt-4o: [gpt-5-mini]}  # If all gpt-4o deployments fail → gpt-5-mini
  default_fallbacks: [gpt-4o]  # High TPM with load balancing

router_settings:
  routing_strategy: "usage-based-routing-v2"
  num_retries: 3
  enable_pre_call_checks: true
  cooldown_time: 60

Request Flow:

1. Request gpt-4o
2. Router tries: gpt-4o-2 (most available TPM)
3. If rate limited → tries gpt-4o-3
4. If rate limited → tries gpt-4o-4
5. If all fail → fallback to gpt-5-mini

Example 2: Cost-Optimized Setup

Goal: Use cheaper models first, expensive models as backup

model_list:
  # Primary: gpt-5-nano (cheapest)
  - model_name: gpt-5-nano
    litellm_params:
      model: azure/gpt-5-nano
      api_base: https://sohan-mbtd9z9j-eastus2.openai.azure.com/
      api_key: KEY_1
      client: azure
    model_info:
      tpm: 100000
      rpm: 500

  # Fallback 1: gpt-5-mini (moderate cost)
  - model_name: gpt-5-mini
    litellm_params:
      model: azure/gpt-5-mini
      api_base: https://sohan-mbtd9z9j-eastus2.openai.azure.com/
      api_key: KEY_2
      client: azure
    model_info:
      tpm: 120000
      rpm: 500

  # Fallback 2: gpt-4o (expensive, reliable, HIGH LIMIT)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-2
      api_base: https://openai-248.openai.azure.com/
      api_key: KEY_3
      client: azure
    model_info:
      tpm: 1200000  # ✅ High limit for default fallback
      rpm: 500

litellm_settings:
  fallbacks:
    - {gpt-5-nano: [gpt-5-mini, gpt-4o]}  # nano → mini → gpt-4o
  default_fallbacks: [gpt-4o]  # ✅ High-limit model as default

Example 3: Performance-Optimized Setup

Goal: Minimize latency, maximize speed

router_settings:
  routing_strategy: "latency-based-routing"  # Route to fastest deployment
  num_retries: 1  # Fast fail
  enable_pre_call_checks: false  # Skip checks for speed
  cooldown_time: 30  # Shorter cooldown

litellm_settings:
  fallbacks:
    - {gpt-5-chat: [gpt-5-mini]}  # Smaller model if needed
  default_fallbacks: [gpt-4o]  # ✅ Reliable high-limit fallback
  drop_params: true  # Drop unsupported params instead of failing

Best Practices

1. Load Balancing Best Practices

DO:

Use 2-4 deployments per model for optimal balance
Set accurate TPM/RPM limits based on Azure quotas
Enable enable_pre_call_checks to prevent rate limits
Monitor usage patterns and adjust capacities
Use usage-based-routing-v2 for production

DON'T:

Use more than 5 deployments (diminishing returns)
Set TPM/RPM limits higher than actual quotas
Mix different model versions under same model name
Disable retries in production

2. Fallback Best Practices

DO:

Choose fallback models with similar capabilities
Use cheaper models as fallbacks when appropriate
Set reasonable retry counts (2-3)
Define context_window_fallbacks for large requests
Test fallback chains before production
CRITICAL: Ensure default fallback has HIGH rate limits
Use load-balanced models as default fallbacks

DON'T:

Create circular fallbacks (A → B → A)
Use drastically different models as fallbacks
Set ContentPolicyViolationErrorRetries > 0
Have more than 3 levels in fallback chain
Use low-limit models as default fallback
Use single-deployment models with low TPM as default

3. Default Fallback Selection Guide

Scenario	Recommended Default Fallback	Reason
`High Traffic`	Load-balanced gpt-4o (3+ deployments)	Handles overflow from all models
`Medium Traffic`	Single gpt-4o with high TPM	Sufficient capacity for occasional fallback
`Low Traffic`	gpt-5-mini (if high TPM available)	Cost-effective for low usage
`Cost-Sensitive`	gpt-4o with moderate TPM	Balance between cost and reliability

Example Capacity Planning

# Total expected traffic: 500K TPM
# Primary models: 300K TPM capacity
# Default fallback should handle: 200K+ TPM overflow

# GOOD: 3x gpt-4o = 3M TPM total capacity
default_fallbacks: [gpt-4o]  # With 3 load-balanced deployments

# BAD: 1x gpt-5-nano = 100K TPM (insufficient)
default_fallbacks: [gpt-5-nano]  # Will bottleneck at 100K TPM

4. Cooldown Configuration

router_settings:
  cooldown_time: 60  # Recommended: 60-120 seconds

Guidelines:

30s: High traffic, quick recovery needed
60s: Standard production use (recommended)
120s: Conservative, prevent repeated failures
300s+: Very conservative, long-running errors

5. Monitoring Configuration

general_settings:
  store_model_in_db: true  # Enable for tracking
  disable_spend_logs: false  # Track costs

litellm_settings:
  logging: true  # Enable detailed logs

Monitoring & Troubleshooting

Check Router Status

# Check health
curl http://localhost:4000/health

# Response includes:
{
  "status": "healthy",
  "models_available": ["gpt-4o", "gpt-5-chat", ...],
  "load_balancing": "enabled"
}

View Model Statistics

# Get model usage stats (if implemented)
curl http://localhost:4000/model/info

Common Issues & Solutions

Issue 1: All Deployments Rate Limited

Symptoms:

Error: All models exhausted. Rate limit exceeded on all deployments.

Solutions: 1. Add more deployments:

- model_name: gpt-4o
  litellm_params:
    model: azure/gpt-4o-5  # Add 5th deployment

Increase cooldown time:

router_settings:
  cooldown_time: 120  # Give more recovery time

Add fallback models:

litellm_settings:
  fallbacks:
    - {gpt-4o: [gpt-5-mini, gpt-5-nano]}

Issue 2: Fallback Not Working

Symptoms:

Error: Model failed, no fallback attempted

Check: 1. Fallback model is configured:

litellm_settings:
  fallbacks:
    - {gpt-5-chat: [gpt-4o]}  # Must be defined

Retry policy allows retries:

router_settings:
  retry_policy:
    RateLimitErrorRetries: 3  # Must be > 0

Fallback model exists in model_list:

model_list:
  - model_name: gpt-4o  # Fallback model must exist

Issue 3: Uneven Load Distribution

Symptoms:

One deployment handles all traffic
Other deployments idle

Solutions:

Use correct routing strategy:

router_settings:
  routing_strategy: "usage-based-routing-v2"  # Not "simple-shuffle"

Set accurate TPM/RPM:

model_info:
  tpm: 1200000  # Match Azure quota exactly
  rpm: 500      # Match Azure quota exactly

Enable pre-call checks:

router_settings:
  enable_pre_call_checks: true

Issue 4: Circular Fallback Loop

Symptoms:

Error: Maximum fallback depth exceeded

Bad Configuration:

fallbacks:
  - {gpt-5-mini: [gpt-5-nano]}
  - {gpt-5-nano: [gpt-5-mini]}  # Circular!

Good Configuration:

fallbacks:
  - {gpt-5-mini: [gpt-4o]}     # Linear chain
  - {gpt-5-nano: [gpt-5-mini, gpt-4o]}  # Multi-level

Issue 5: Default Fallback Bottleneck

Symptoms:

Error: Rate limit on default fallback model
Multiple models failing simultaneously
High latency during peak traffic

Root Cause:

Default fallback has insufficient capacity for overflow traffic.

Solutions:

1. Use load-balanced default fallback:

# Add multiple deployments of default fallback
- model_name: gpt-4o
  litellm_params: {model: azure/gpt-4o-2, ...}
  model_info: {tpm: 1200000, rpm: 500}
- model_name: gpt-4o
  litellm_params: {model: azure/gpt-4o-3, ...}
  model_info: {tpm: 916000, rpm: 500}
- model_name: gpt-4o
  litellm_params: {model: azure/gpt-4o-4, ...}
  model_info: {tpm: 977000, rpm: 500}

litellm_settings:

  default_fallbacks: [gpt-4o]  # Now has 3M+ total TPM

2. Increase quota on default fallback model:

Contact Azure support to increase TPM/RPM limits
Switch to a higher-tier deployment

3. Add secondary default fallback:

litellm_settings:
  default_fallbacks: [gpt-4o, gpt-5-mini]  # Chain of fallbacks

Performance Metrics

Expected Improvements

Configuration	Availability	Throughput	Cost	Latency
`Single Model`	99.0%	1x	Low	Baseline
`3x Load Balanced`	99.9%	3x	Medium	+5-10ms
`Load Balanced + Fallback`	99.99%	3x	Medium	+10-15ms
`Multi-tier Fallbacks`	99.999%	3-4x	High	+15-25ms

Real-World Example

Setup: 3x gpt-4o deployments + gpt-5-mini fallback

Results:

Throughput: 300% increase (3x deployments)
⏱Latency: +12ms average (routing overhead)
Cost: 15% savings (cheaper fallback handles 10% of traffic)
Availability: 99.95% uptime (vs 99.0% single deployment)

Testing Your Configuration

Test Load Balancing

import asyncio
from litellm import acompletion

async def test_load_balancing():
    """Send 10 requests and see which deployments are used"""
    for i in range(10):
        response = await acompletion(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Test {i}"}],
            api_base="http://localhost:4000"
        )
        print(f"Request {i}: Used model: {response.model}")

asyncio.run(test_load_balancing())

Expected Output:

Request 0: Used model: azure/gpt-4o-2
Request 1: Used model: azure/gpt-4o-3
Request 2: Used model: azure/gpt-4o-4
Request 3: Used model: azure/gpt-4o-2  # Balanced distribution
...

Test Fallback

async def test_fallback():
    """Request a model that will fail and fallback"""
    try:
        response = await acompletion(
            model="gpt-5-chat",
            messages=[{"role": "user", "content": "Test"}],
            api_base="http://localhost:4000",
            max_tokens=999999  # Force context limit error
        )
        print(f"Used fallback model: {response.model}")
    except Exception as e:
        print(f"Fallback failed: {e}")

asyncio.run(test_fallback())

Test Default Fallback Capacity

async def stress_test_default_fallback():
    """Stress test default fallback with concurrent requests"""
    import time

    async def make_request(i):
        start = time.time()
        try:
            response = await acompletion(
                model="undefined-model",  # Will use default fallback
                messages=[{"role": "user", "content": f"Test {i}"}],
                api_base="http://localhost:4000"
            )
            elapsed = time.time() - start
            print(f"Request {i}: Success in {elapsed:.2f}s - Model: {response.model}")
        except Exception as e:
            elapsed = time.time() - start
            print(f"Request {i}: Failed in {elapsed:.2f}s - Error: {str(e)[:50]}")

    # Send 50 concurrent requests to test fallback capacity
    tasks = [make_request(i) for i in range(50)]
    await asyncio.gather(*tasks)

asyncio.run(stress_test_default_fallback())

Summary

Quick Reference

Feature	Configuration Location	Key Parameter
Load Balancing	`model_list`	Same `model_name`, different deployments
Routing Strategy	`router_settings`	`routing_strategy: "usage-based-routing-v2"`
Model Fallbacks	`litellm_settings.fallbacks`	`{primary: [fallback1, fallback2]}`
Context Fallbacks	`litellm_settings.context_window_fallbacks`	`{small_context: [large_context]}`
Default Fallback	`litellm_settings.default_fallbacks`	`[gpt-4o]` Must have high TPM/RPM
Retry Policy	`router_settings.retry_policy`	`RateLimitErrorRetries: 3`
Cooldown	`router_settings`	`cooldown_time: 60`

Recommended Production Config

model_list:
  # Default fallback: Load-balanced gpt-4o with HIGH capacity
  - model_name: gpt-4o
    litellm_params: {model: azure/gpt-4o-2, ...}
    model_info: {tpm: 1200000, rpm: 500}  # High limit
  - model_name: gpt-4o
    litellm_params: {model: azure/gpt-4o-3, ...}
    model_info: {tpm: 916000, rpm: 500}
  - model_name: gpt-4o
    litellm_params: {model: azure/gpt-4o-4, ...}
    model_info: {tpm: 977000, rpm: 500}

litellm_settings:
  fallbacks:
    - {gpt-4o: [gpt-5-mini]}
  default_fallbacks: [gpt-4o]  # High-capacity load-balanced model

router_settings:
  routing_strategy: "usage-based-routing-v2"
  num_retries: 3
  enable_pre_call_checks: true
  cooldown_time: 60
  retry_policy:
    RateLimitErrorRetries: 3
    TimeoutErrorRetries: 3

Critical Reminders

DEFAULT FALLBACK MUST HAVE HIGH RATE LIMITS

It handles overflow from ALL models
Should be load-balanced with multiple deployments
Typical requirement: 2-5x the TPM of any single primary model
Monitor closely during production to ensure adequate capacity

Additional Resources

LiteLLM Docs: https://docs.litellm.ai/docs/routing
Azure OpenAI Rate Limits: https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
Load Balancing Strategies: https://docs.litellm.ai/docs/routing#advanced-routing-strategies