Load Balancing & Fallback Models
This guide explains how to configure and use load balancing and fallback models in the LiteLLM proxy server. These features ensure high availability, optimal performance, and automatic failover when models are unavailable or rate-limited.
Load Balancing
Load balancing distributes requests across multiple deployments of the same model to:
Maximize throughputby utilizing multiple API keys/deploymentsAvoid rate limitsby spreading load across different endpointsImprove reliabilityby having redundant deploymentsOptimize costsby routing to the most available deployment
Supported Strategies
1. Usage-Based Routing v2 (Recommended)
Routes requests based on real-time TPM (Tokens Per Minute) and RPM (Requests Per Minute) availability.
router_settings:
routing_strategy: "usage-based-routing-v2"
enable_pre_call_checks: true
How it works:
- Tracks token and request usage in real-time
- Routes to deployment with most available capacity
- Prevents hitting rate limits before they occur
- Automatically updates based on actual usage
2. Simple Shuffle
Randomly distributes requests across available deployments.
router_settings:
routing_strategy: "simple-shuffle"
Use case
Equal load distribution without capacity tracking
3. Latency-Based Routing
Routes to the fastest responding deployment.
router_settings:
routing_strategy: "latency-based-routing"
Use case
Minimize response time for latency-sensitive applications
4. Least Busy
Routes to the deployment handling the fewest requests.
router_settings:
routing_strategy: "least-busy"
Use case
Prevent overloading individual deployments
Load Balancing Configuration
model_list:
# Multiple deployments of the same model for load balancing
- model_name: gpt-4o # Same model name
litellm_params:
model: azure/gpt-4o-2 # Different deployment
api_base: https://openai-248.openai.azure.com/
api_key: YOUR_KEY_1
client: azure
model_info:
tpm: 1200000 # Tokens per minute limit
rpm: 500 # Requests per minute limit
- model_name: gpt-4o # Same model name
litellm_params:
model: azure/gpt-4o-3 # Different deployment
api_base: https://openai-248.openai.azure.com/
api_key: YOUR_KEY_2
client: azure
model_info:
tpm: 916000
rpm: 500
- model_name: gpt-4o # Same model name
litellm_params:
model: azure/gpt-4o-4 # Different deployment
api_base: https://openai-248.openai.azure.com/
api_key: YOUR_KEY_3
client: azure
model_info:
tpm: 977000
rpm: 500
router_settings:
routing_strategy: "usage-based-routing-v2"
num_retries: 3
enable_pre_call_checks: true # Check TPM/RPM before routing
cooldown_time: 60 # Cooldown period after rate limit (seconds)
Key Points:
- Use same
model_namefor all deployments you want to load balance - Use different deployment names in
litellm_params.model - Specify accurate TPM/RPM limits for optimal routing
- Enable pre_call_checks to prevent rate limit errors
How Requests Are Routed
Client Request: "gpt-4o"
↓
Router checks:
- Which gpt-4o deployments are available?
- Which has most available TPM/RPM?
- Any deployments in cooldown?
↓
Routes to: azure/gpt-4o-3 (most available capacity)
↓
If rate limited → Cooldown 60s → Try azure/gpt-4o-2
Fallback Models
Fallback models are alternative models used when the primary model:
- Is rate-limited
- Times out
- Returns an error
- Is temporarily unavailable
1. Model-Specific Fallbacks
Define fallbacks for specific models:
litellm_settings:
fallbacks:
- {gpt-5-chat: [gpt-4o]} # gpt-5-chat → gpt-4o
- {gpt-5-mini: [gpt-4o]} # gpt-5-mini → gpt-4o
- {gpt-5-nano: [gpt-5-mini, gpt-4o]} # gpt-5-nano → gpt-5-mini → gpt-4o
- {gpt-35-turbo: [gpt-4o]} # gpt-35-turbo → gpt-4o
Fallback Chain:
Request: gpt-5-nano
↓
Try: gpt-5-nano (failed)
↓
Try: gpt-5-mini (failed)
↓
Try: gpt-4o (success) ✓
2. Context Window Fallbacks
Automatically fallback when context is too large:
litellm_settings:
context_window_fallbacks:
- {gpt-4o: [gpt-5-chat]} # If gpt-4o context exceeded → use gpt-5-chat
Use case
Request with 200K tokens → gpt-4o (128K limit) → gpt-5-chat (256K limit)
3. Default Fallbacks
Fallback for models without specific configuration:
litellm_settings:
default_fallbacks: [gpt-4o] # Any undefined model → gpt-4o
Important
The default fallback model should have HIGH RATE LIMITS (high TPM/RPM) since it will handle overflow traffic from all other models. Choose a model with the highest quota or multiple load-balanced deployments to prevent it from becoming a bottleneck.
Recommended Default Fallback Configuration:
# GOOD: Default fallback with high limits and load balancing
model_list:
- model_name: gpt-4o # Default fallback
litellm_params:
model: azure/gpt-4o-2
api_key: KEY_1
model_info:
tpm: 1200000 # High limit
rpm: 500
- model_name: gpt-4o # Load balanced
litellm_params:
model: azure/gpt-4o-3
api_key: KEY_2
model_info:
tpm: 916000
rpm: 500
- model_name: gpt-4o # Load balanced
litellm_params:
model: azure/gpt-4o-4
api_key: KEY_3
model_info:
tpm: 977000
rpm: 500
litellm_settings:
default_fallbacks: [gpt-4o] # Uses load-balanced gpt-4o with 3M+ total TPM
# BAD: Low-limit model as default fallback
litellm_settings:
default_fallbacks: [gpt-5-nano] # Only 100K TPM - will bottleneck!
Retry Policy Configuration
Control which errors trigger fallbacks:
router_settings:
retry_policy:
ContentPolicyViolationErrorRetries: 0 # Never retry content violations
BadRequestErrorRetries: 0 # Never retry bad requests
TimeoutErrorRetries: 3 # Retry timeouts 3 times
InternalServerErrorRetries: 3 # Retry server errors 3 times
RateLimitErrorRetries: 3 # Retry rate limits 3 times
Allowed Fails Configuration
Set how many failures before moving to fallback:
router_settings:
allowed_fails_policy:
ContentPolicyViolationErrorAllowedFails: 0 # Fail immediately
allowed_fails: 0 # Default for other errors
cooldown_time: 60 # Cooldown after rate limit
Configuration Examples
Example 1: High-Availability Setup
Goal: Maximum uptime with automatic failover
model_list:
# Primary: 3x gpt-4o deployments (load balanced)
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-2
api_base: https://openai-248.openai.azure.com/
api_key: KEY_1
client: azure
model_info:
tpm: 1200000
rpm: 500
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-3
api_base: https://openai-248.openai.azure.com/
api_key: KEY_2
client: azure
model_info:
tpm: 916000
rpm: 500
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-4
api_base: https://openai-248.openai.azure.com/
api_key: KEY_3
client: azure
model_info:
tpm: 977000
rpm: 500
# Fallback: gpt-5-mini (cheaper, faster)
- model_name: gpt-5-mini
litellm_params:
model: azure/gpt-5-mini
api_base: https://sohan-mbtd9z9j-eastus2.openai.azure.com/
api_key: KEY_4
client: azure
model_info:
tpm: 120000
rpm: 500
litellm_settings:
fallbacks:
- {gpt-4o: [gpt-5-mini]} # If all gpt-4o deployments fail → gpt-5-mini
default_fallbacks: [gpt-4o] # High TPM with load balancing
router_settings:
routing_strategy: "usage-based-routing-v2"
num_retries: 3
enable_pre_call_checks: true
cooldown_time: 60
Request Flow:
1. Request gpt-4o
2. Router tries: gpt-4o-2 (most available TPM)
3. If rate limited → tries gpt-4o-3
4. If rate limited → tries gpt-4o-4
5. If all fail → fallback to gpt-5-mini
Example 2: Cost-Optimized Setup
Goal: Use cheaper models first, expensive models as backup
model_list:
# Primary: gpt-5-nano (cheapest)
- model_name: gpt-5-nano
litellm_params:
model: azure/gpt-5-nano
api_base: https://sohan-mbtd9z9j-eastus2.openai.azure.com/
api_key: KEY_1
client: azure
model_info:
tpm: 100000
rpm: 500
# Fallback 1: gpt-5-mini (moderate cost)
- model_name: gpt-5-mini
litellm_params:
model: azure/gpt-5-mini
api_base: https://sohan-mbtd9z9j-eastus2.openai.azure.com/
api_key: KEY_2
client: azure
model_info:
tpm: 120000
rpm: 500
# Fallback 2: gpt-4o (expensive, reliable, HIGH LIMIT)
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-2
api_base: https://openai-248.openai.azure.com/
api_key: KEY_3
client: azure
model_info:
tpm: 1200000 # ✅ High limit for default fallback
rpm: 500
litellm_settings:
fallbacks:
- {gpt-5-nano: [gpt-5-mini, gpt-4o]} # nano → mini → gpt-4o
default_fallbacks: [gpt-4o] # ✅ High-limit model as default
Example 3: Performance-Optimized Setup
Goal: Minimize latency, maximize speed
router_settings:
routing_strategy: "latency-based-routing" # Route to fastest deployment
num_retries: 1 # Fast fail
enable_pre_call_checks: false # Skip checks for speed
cooldown_time: 30 # Shorter cooldown
litellm_settings:
fallbacks:
- {gpt-5-chat: [gpt-5-mini]} # Smaller model if needed
default_fallbacks: [gpt-4o] # ✅ Reliable high-limit fallback
drop_params: true # Drop unsupported params instead of failing
Best Practices
1. Load Balancing Best Practices
DO:
- Use 2-4 deployments per model for optimal balance
- Set accurate TPM/RPM limits based on Azure quotas
- Enable
enable_pre_call_checksto prevent rate limits - Monitor usage patterns and adjust capacities
- Use
usage-based-routing-v2for production
DON'T:
- Use more than 5 deployments (diminishing returns)
- Set TPM/RPM limits higher than actual quotas
- Mix different model versions under same model name
- Disable retries in production
2. Fallback Best Practices
DO:
- Choose fallback models with similar capabilities
- Use cheaper models as fallbacks when appropriate
- Set reasonable retry counts (2-3)
- Define context_window_fallbacks for large requests
- Test fallback chains before production
CRITICAL:Ensure default fallback has HIGH rate limits- Use load-balanced models as default fallbacks
DON'T:
- Create circular fallbacks (A → B → A)
- Use drastically different models as fallbacks
- Set ContentPolicyViolationErrorRetries > 0
- Have more than 3 levels in fallback chain
- Use low-limit models as default fallback
- Use single-deployment models with low TPM as default
3. Default Fallback Selection Guide
| Scenario | Recommended Default Fallback | Reason |
|---|---|---|
High Traffic |
Load-balanced gpt-4o (3+ deployments) | Handles overflow from all models |
Medium Traffic |
Single gpt-4o with high TPM | Sufficient capacity for occasional fallback |
Low Traffic |
gpt-5-mini (if high TPM available) | Cost-effective for low usage |
Cost-Sensitive |
gpt-4o with moderate TPM | Balance between cost and reliability |
Example Capacity Planning
# Total expected traffic: 500K TPM
# Primary models: 300K TPM capacity
# Default fallback should handle: 200K+ TPM overflow
# GOOD: 3x gpt-4o = 3M TPM total capacity
default_fallbacks: [gpt-4o] # With 3 load-balanced deployments
# BAD: 1x gpt-5-nano = 100K TPM (insufficient)
default_fallbacks: [gpt-5-nano] # Will bottleneck at 100K TPM
4. Cooldown Configuration
router_settings:
cooldown_time: 60 # Recommended: 60-120 seconds
Guidelines:
30s:High traffic, quick recovery needed60s:Standard production use (recommended)120s:Conservative, prevent repeated failures300s+:Very conservative, long-running errors
5. Monitoring Configuration
general_settings:
store_model_in_db: true # Enable for tracking
disable_spend_logs: false # Track costs
litellm_settings:
logging: true # Enable detailed logs
Monitoring & Troubleshooting
Check Router Status
# Check health
curl http://localhost:4000/health
# Response includes:
{
"status": "healthy",
"models_available": ["gpt-4o", "gpt-5-chat", ...],
"load_balancing": "enabled"
}
View Model Statistics
# Get model usage stats (if implemented)
curl http://localhost:4000/model/info
Common Issues & Solutions
Issue 1: All Deployments Rate Limited
Symptoms:
Error: All models exhausted. Rate limit exceeded on all deployments.
Solutions: 1. Add more deployments:
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-5 # Add 5th deployment
-
Increase cooldown time:
router_settings: cooldown_time: 120 # Give more recovery time -
Add fallback models:
litellm_settings: fallbacks: - {gpt-4o: [gpt-5-mini, gpt-5-nano]}
Issue 2: Fallback Not Working
Symptoms:
Error: Model failed, no fallback attempted
Check: 1. Fallback model is configured:
litellm_settings:
fallbacks:
- {gpt-5-chat: [gpt-4o]} # Must be defined
-
Retry policy allows retries:
router_settings: retry_policy: RateLimitErrorRetries: 3 # Must be > 0 -
Fallback model exists in model_list:
model_list: - model_name: gpt-4o # Fallback model must exist
Issue 3: Uneven Load Distribution
Symptoms:
- One deployment handles all traffic
- Other deployments idle
Solutions:
-
Use correct routing strategy:
router_settings: routing_strategy: "usage-based-routing-v2" # Not "simple-shuffle" -
Set accurate TPM/RPM:
model_info: tpm: 1200000 # Match Azure quota exactly rpm: 500 # Match Azure quota exactly -
Enable pre-call checks:
router_settings:
enable_pre_call_checks: true
Issue 4: Circular Fallback Loop
Symptoms:
Error: Maximum fallback depth exceeded
Bad Configuration:
fallbacks:
- {gpt-5-mini: [gpt-5-nano]}
- {gpt-5-nano: [gpt-5-mini]} # Circular!
Good Configuration:
fallbacks:
- {gpt-5-mini: [gpt-4o]} # Linear chain
- {gpt-5-nano: [gpt-5-mini, gpt-4o]} # Multi-level
Issue 5: Default Fallback Bottleneck
Symptoms:
Error: Rate limit on default fallback model
Multiple models failing simultaneously
High latency during peak traffic
Root Cause:
Default fallback has insufficient capacity for overflow traffic.
Solutions:
1. Use load-balanced default fallback:
# Add multiple deployments of default fallback
- model_name: gpt-4o
litellm_params: {model: azure/gpt-4o-2, ...}
model_info: {tpm: 1200000, rpm: 500}
- model_name: gpt-4o
litellm_params: {model: azure/gpt-4o-3, ...}
model_info: {tpm: 916000, rpm: 500}
- model_name: gpt-4o
litellm_params: {model: azure/gpt-4o-4, ...}
model_info: {tpm: 977000, rpm: 500}
litellm_settings:
default_fallbacks: [gpt-4o] # Now has 3M+ total TPM
2. Increase quota on default fallback model:
- Contact Azure support to increase TPM/RPM limits
- Switch to a higher-tier deployment
3. Add secondary default fallback:
litellm_settings:
default_fallbacks: [gpt-4o, gpt-5-mini] # Chain of fallbacks
Performance Metrics
Expected Improvements
| Configuration | Availability | Throughput | Cost | Latency |
|---|---|---|---|---|
Single Model |
99.0% | 1x | Low | Baseline |
3x Load Balanced |
99.9% | 3x | Medium | +5-10ms |
Load Balanced + Fallback |
99.99% | 3x | Medium | +10-15ms |
Multi-tier Fallbacks |
99.999% | 3-4x | High | +15-25ms |
Real-World Example
Setup: 3x gpt-4o deployments + gpt-5-mini fallback
Results:
- Throughput:
300%increase (3x deployments) - ⏱Latency:
+12msaverage (routing overhead) - Cost:
15%savings (cheaper fallback handles 10% of traffic) - Availability:
99.95%uptime (vs 99.0% single deployment)
Testing Your Configuration
Test Load Balancing
import asyncio
from litellm import acompletion
async def test_load_balancing():
"""Send 10 requests and see which deployments are used"""
for i in range(10):
response = await acompletion(
model="gpt-4o",
messages=[{"role": "user", "content": f"Test {i}"}],
api_base="http://localhost:4000"
)
print(f"Request {i}: Used model: {response.model}")
asyncio.run(test_load_balancing())
Expected Output:
Request 0: Used model: azure/gpt-4o-2
Request 1: Used model: azure/gpt-4o-3
Request 2: Used model: azure/gpt-4o-4
Request 3: Used model: azure/gpt-4o-2 # Balanced distribution
...
Test Fallback
async def test_fallback():
"""Request a model that will fail and fallback"""
try:
response = await acompletion(
model="gpt-5-chat",
messages=[{"role": "user", "content": "Test"}],
api_base="http://localhost:4000",
max_tokens=999999 # Force context limit error
)
print(f"Used fallback model: {response.model}")
except Exception as e:
print(f"Fallback failed: {e}")
asyncio.run(test_fallback())
Test Default Fallback Capacity
async def stress_test_default_fallback():
"""Stress test default fallback with concurrent requests"""
import time
async def make_request(i):
start = time.time()
try:
response = await acompletion(
model="undefined-model", # Will use default fallback
messages=[{"role": "user", "content": f"Test {i}"}],
api_base="http://localhost:4000"
)
elapsed = time.time() - start
print(f"Request {i}: Success in {elapsed:.2f}s - Model: {response.model}")
except Exception as e:
elapsed = time.time() - start
print(f"Request {i}: Failed in {elapsed:.2f}s - Error: {str(e)[:50]}")
# Send 50 concurrent requests to test fallback capacity
tasks = [make_request(i) for i in range(50)]
await asyncio.gather(*tasks)
asyncio.run(stress_test_default_fallback())
Summary
Quick Reference
| Feature | Configuration Location | Key Parameter |
|---|---|---|
| Load Balancing | model_list |
Same model_name, different deployments |
| Routing Strategy | router_settings |
routing_strategy: "usage-based-routing-v2" |
| Model Fallbacks | litellm_settings.fallbacks |
{primary: [fallback1, fallback2]} |
| Context Fallbacks | litellm_settings.context_window_fallbacks |
{small_context: [large_context]} |
| Default Fallback | litellm_settings.default_fallbacks |
[gpt-4o] Must have high TPM/RPM |
| Retry Policy | router_settings.retry_policy |
RateLimitErrorRetries: 3 |
| Cooldown | router_settings |
cooldown_time: 60 |
Recommended Production Config
model_list:
# Default fallback: Load-balanced gpt-4o with HIGH capacity
- model_name: gpt-4o
litellm_params: {model: azure/gpt-4o-2, ...}
model_info: {tpm: 1200000, rpm: 500} # High limit
- model_name: gpt-4o
litellm_params: {model: azure/gpt-4o-3, ...}
model_info: {tpm: 916000, rpm: 500}
- model_name: gpt-4o
litellm_params: {model: azure/gpt-4o-4, ...}
model_info: {tpm: 977000, rpm: 500}
litellm_settings:
fallbacks:
- {gpt-4o: [gpt-5-mini]}
default_fallbacks: [gpt-4o] # High-capacity load-balanced model
router_settings:
routing_strategy: "usage-based-routing-v2"
num_retries: 3
enable_pre_call_checks: true
cooldown_time: 60
retry_policy:
RateLimitErrorRetries: 3
TimeoutErrorRetries: 3
Critical Reminders
DEFAULT FALLBACK MUST HAVE HIGH RATE LIMITS
- It handles overflow from ALL models
- Should be load-balanced with multiple deployments
- Typical requirement: 2-5x the TPM of any single primary model
- Monitor closely during production to ensure adequate capacity
Additional Resources
- LiteLLM Docs: https://docs.litellm.ai/docs/routing
- Azure OpenAI Rate Limits: https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
- Load Balancing Strategies: https://docs.litellm.ai/docs/routing#advanced-routing-strategies