Skip to content

System Design: Service Degradation and Rate Limiting at Scale

System Design: Service Degradation and Rate Limiting at Scale

Section titled “System Design: Service Degradation and Rate Limiting at Scale”

At scale, outages are often not caused by a single hard failure. They come from overload, retries, queue buildup, and dependencies timing out together.

Two controls prevent this cascade:

  1. Rate limiting to control demand
  2. Service degradation to preserve core functionality under stress

The goal is not to keep every feature alive. The goal is to keep critical paths alive.

  1. Protect core APIs from traffic spikes and abusive clients.
  2. Keep p95/p99 latency bounded during partial failures.
  3. Degrade non-critical features before core features fail.
  4. Recover quickly and automatically when pressure drops.

Use layered protection, not a single limiter:

  1. Edge limiter at CDN or API gateway (cheap rejection early).
  2. Service-level limiter inside each API service (local safety).
  3. Dependency limiter for expensive downstream calls (DB/search/payments).
  4. Queue and worker backpressure controls (bounded concurrency).

High-level request path:

Client -> CDN/WAF -> API Gateway -> Service -> Dependencies
| | |
coarse limit tenant limit per-dependency limit

Use different algorithms for different layers:

  1. Token bucket at edge for burst tolerance.
  2. Sliding window counter for tenant/account fairness.
  3. Concurrency limits for expensive handlers.

Recommended key hierarchy:

  1. Global limit: protect total system capacity.
  2. Per-tenant limit: prevent noisy neighbors.
  3. Per-endpoint limit: protect expensive routes.
  4. Per-user/IP fallback: mitigate abuse.

Response behavior:

  1. Return 429 Too Many Requests.
  2. Include Retry-After.
  3. Include rate limit headers for client adaptation.

Define behavior before incidents happen. Example ladder:

  1. Level 0 (Normal): all features enabled.
  2. Level 1 (Guarded): disable non-essential background refresh jobs.
  3. Level 2 (Constrained): serve cached results for read-heavy endpoints.
  4. Level 3 (Critical): disable optional features (recommendations, analytics widgets, exports).
  5. Level 4 (Emergency): read-only mode for non-critical domains; core transactions only.

Degradation should be controlled by feature flags and SLO triggers, not manual code edits.

Separate policy distribution from request decisions:

  1. Control plane stores limits, tenant tiers, and degradation rules.
  2. Data plane enforces limits locally with low-latency caches.

Implementation pattern:

  1. Source of truth in config DB.
  2. Push updates through pub/sub.
  3. Service instances keep in-memory snapshot with TTL fallback.

This avoids making each request depend on a remote policy lookup.

Minimal practical blueprint:

  1. API gateway enforces global + tenant token buckets.
  2. Service middleware enforces endpoint-level sliding window.
  3. Redis used for shared counters where cross-instance fairness is required.
  4. Local in-memory fallback limiter when Redis is unavailable.
  5. Circuit breakers for downstream dependencies.
  6. Prioritized worker queues for async jobs.

Pseudo flow:

if global_limit_exceeded: return 429
if tenant_limit_exceeded: return 429
if endpoint_concurrency_high:
if endpoint_optional: return degraded_response
else: queue_or_fail_fast
if dependency_unhealthy: use cache/fallback
process_request

Below is a minimal implementation of layered limits and degradation controls.

import os
import time
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import redis.asyncio as redis
app = FastAPI()
r = redis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379/0"), decode_responses=True)
LUA_LIMIT = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call("INCR", key)
if current == 1 then redis.call("EXPIRE", key, window) end
if current > limit then return 0 end
return 1
"""
limit_script = r.register_script(LUA_LIMIT)
POLICY = {
"global": {"limit": 2000, "window": 60},
"tenant_default": {"limit": 300, "window": 60},
"endpoint": {
"/search": {"limit": 120, "window": 60},
"/feed": {"limit": 600, "window": 60},
},
}
DEGRADE_LEVEL = int(os.getenv("DEGRADE_LEVEL", "0"))
async def allow(key: str, limit: int, window: int) -> bool:
try:
result = await limit_script(keys=[key], args=[limit, window])
return result == 1
except Exception:
# Fallback choice depends on endpoint criticality; this is fail-open.
return True
def tenant_id(req: Request) -> str:
return req.headers.get("x-tenant-id", "anonymous")
@app.middleware("http")
async def protection_middleware(request: Request, call_next):
path = request.url.path
t_id = tenant_id(request)
minute = int(time.time() // 60)
g = POLICY["global"]
if not await allow(f"rl:global:{minute}", g["limit"], g["window"]):
return JSONResponse({"error": "global rate limit"}, status_code=429, headers={"Retry-After": "5"})
t = POLICY["tenant_default"]
if not await allow(f"rl:tenant:{t_id}:{minute}", t["limit"], t["window"]):
return JSONResponse({"error": "tenant rate limit"}, status_code=429, headers={"Retry-After": "5"})
ep = POLICY["endpoint"].get(path)
if ep and not await allow(f"rl:endpoint:{path}:{minute}", ep["limit"], ep["window"]):
return JSONResponse({"error": "endpoint rate limit"}, status_code=429, headers={"Retry-After": "5"})
if DEGRADE_LEVEL >= 2 and path == "/feed":
return JSONResponse({"items": [], "degraded": True, "reason": "high load"}, status_code=200)
if DEGRADE_LEVEL >= 3 and path == "/search":
return JSONResponse({"error": "temporarily unavailable", "degraded": True}, status_code=503)
return await call_next(request)
@app.get("/feed")
async def feed():
return {"items": ["a", "b", "c"], "degraded": False}
@app.get("/search")
async def search(q: str):
return {"query": q, "results": ["r1", "r2"]}

Run locally:

Terminal window
pip install fastapi uvicorn redis
uvicorn app:app --reload

Quick test:

Terminal window
curl -H "x-tenant-id: t1" "http://127.0.0.1:8000/search?q=test"

You can run the Python example end-to-end with Docker using these files.

app.py:

import os
import time
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import redis.asyncio as redis
app = FastAPI()
r = redis.from_url(os.getenv("REDIS_URL", "redis://redis:6379/0"), decode_responses=True)
LUA_LIMIT = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call("INCR", key)
if current == 1 then redis.call("EXPIRE", key, window) end
if current > limit then return 0 end
return 1
"""
limit_script = r.register_script(LUA_LIMIT)
POLICY = {
"global": {"limit": 2000, "window": 60},
"tenant_default": {"limit": 120, "window": 60},
"endpoint": {"/search": {"limit": 20, "window": 60}},
}
DEGRADE_LEVEL = int(os.getenv("DEGRADE_LEVEL", "0"))
async def allow(key: str, limit: int, window: int) -> bool:
try:
return (await limit_script(keys=[key], args=[limit, window])) == 1
except Exception:
return True
@app.middleware("http")
async def limiter(request: Request, call_next):
path = request.url.path
tenant = request.headers.get("x-tenant-id", "anonymous")
minute = int(time.time() // 60)
g = POLICY["global"]
if not await allow(f"rl:g:{minute}", g["limit"], g["window"]):
return JSONResponse({"error": "global limit"}, status_code=429, headers={"Retry-After": "5"})
t = POLICY["tenant_default"]
if not await allow(f"rl:t:{tenant}:{minute}", t["limit"], t["window"]):
return JSONResponse({"error": "tenant limit"}, status_code=429, headers={"Retry-After": "5"})
ep = POLICY["endpoint"].get(path)
if ep and not await allow(f"rl:e:{path}:{minute}", ep["limit"], ep["window"]):
return JSONResponse({"error": "endpoint limit"}, status_code=429, headers={"Retry-After": "5"})
if DEGRADE_LEVEL >= 3 and path == "/search":
return JSONResponse({"error": "temporarily unavailable", "degraded": True}, status_code=503)
return await call_next(request)
@app.get("/health")
async def health():
return {"ok": True, "degrade_level": DEGRADE_LEVEL}
@app.get("/search")
async def search(q: str):
return {"query": q, "results": ["r1", "r2"]}

requirements.txt:

fastapi==0.116.1
uvicorn[standard]==0.35.0
redis==6.4.0

Dockerfile:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

docker-compose.yml:

version: "3.9"
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
api:
build: .
environment:
REDIS_URL: redis://redis:6379/0
DEGRADE_LEVEL: "0"
ports:
- "8000:8000"
depends_on:
- redis

Run:

Terminal window
docker compose up --build

Validate health:

Terminal window
curl http://127.0.0.1:8000/health

Generate traffic and observe limits:

Terminal window
for i in $(seq 1 30); do
curl -s -o /dev/null -w "%{http_code}\n" -H "x-tenant-id: demo" \
"http://127.0.0.1:8000/search?q=test-$i"
done

Test degradation mode:

Terminal window
DEGRADE_LEVEL=3 docker compose up --build
curl -i -H "x-tenant-id: demo" "http://127.0.0.1:8000/search?q=test"

Track these metrics per endpoint and per tenant:

  1. rate_limited_requests_total
  2. degraded_responses_total
  3. request_latency_ms (p50/p95/p99)
  4. dependency_timeout_total
  5. queue_depth and queue_wait_ms
  6. circuit_breaker_state

Key alerts:

  1. Sharp increase in 429s for paid tiers.
  2. Sustained p99 latency above SLO.
  3. Degradation level at 3 or higher for more than N minutes.
  1. Retry storms from clients
    • Mitigation: jittered backoff guidance, stricter edge limits, retry budgets.
  2. Noisy neighbor tenants
    • Mitigation: per-tenant quotas and weighted fairness.
  3. Shared cache or Redis outage
    • Mitigation: local fallback limiter with conservative defaults.
  4. Over-degrading core traffic
    • Mitigation: separate critical and optional routes with distinct policies.
  5. Slow downstream dependency
    • Mitigation: timeout budgets, circuit breaker, stale-cache fallback.
  1. Start in shadow mode (observe-only, no blocking).
  2. Enable 429 enforcement for a small traffic percentage.
  3. Validate false-positive rate and tenant impact.
  4. Enable degradation levels one by one with canary traffic.
  5. Run load tests and game days for dependency failures.
  6. Document incident runbook with exact SLO thresholds and owner actions.

Rate limiting without degradation causes hard failures during incidents. Degradation without rate limiting allows overload to spread.

At scale, both must work together:

  1. Limit demand early.
  2. Protect critical paths first.
  3. Degrade intentionally and recover automatically.