System Design: Service Degradation and Rate Limiting at Scale
Table of Contents
Section titled “Table of Contents”- Overview
- Design Goals
- Architecture
- Rate Limiting Strategy
- Service Degradation Ladder
- Control Plane and Data Plane
- Implementation Blueprint
- Python Implementation (FastAPI + Redis)
- Docker Compose Demo (FastAPI + Redis)
- Observability and SLOs
- Failure Modes and Mitigations
- Rollout Plan
- Conclusion
System Design: Service Degradation and Rate Limiting at Scale
Section titled “System Design: Service Degradation and Rate Limiting at Scale”Overview
Section titled “Overview”At scale, outages are often not caused by a single hard failure. They come from overload, retries, queue buildup, and dependencies timing out together.
Two controls prevent this cascade:
- Rate limiting to control demand
- Service degradation to preserve core functionality under stress
The goal is not to keep every feature alive. The goal is to keep critical paths alive.
Design Goals
Section titled “Design Goals”- Protect core APIs from traffic spikes and abusive clients.
- Keep p95/p99 latency bounded during partial failures.
- Degrade non-critical features before core features fail.
- Recover quickly and automatically when pressure drops.
Architecture
Section titled “Architecture”Use layered protection, not a single limiter:
- Edge limiter at CDN or API gateway (cheap rejection early).
- Service-level limiter inside each API service (local safety).
- Dependency limiter for expensive downstream calls (DB/search/payments).
- Queue and worker backpressure controls (bounded concurrency).
High-level request path:
Client -> CDN/WAF -> API Gateway -> Service -> Dependencies | | | coarse limit tenant limit per-dependency limitRate Limiting Strategy
Section titled “Rate Limiting Strategy”Use different algorithms for different layers:
- Token bucket at edge for burst tolerance.
- Sliding window counter for tenant/account fairness.
- Concurrency limits for expensive handlers.
Recommended key hierarchy:
- Global limit: protect total system capacity.
- Per-tenant limit: prevent noisy neighbors.
- Per-endpoint limit: protect expensive routes.
- Per-user/IP fallback: mitigate abuse.
Response behavior:
- Return
429 Too Many Requests. - Include
Retry-After. - Include rate limit headers for client adaptation.
Service Degradation Ladder
Section titled “Service Degradation Ladder”Define behavior before incidents happen. Example ladder:
- Level 0 (Normal): all features enabled.
- Level 1 (Guarded): disable non-essential background refresh jobs.
- Level 2 (Constrained): serve cached results for read-heavy endpoints.
- Level 3 (Critical): disable optional features (recommendations, analytics widgets, exports).
- Level 4 (Emergency): read-only mode for non-critical domains; core transactions only.
Degradation should be controlled by feature flags and SLO triggers, not manual code edits.
Control Plane and Data Plane
Section titled “Control Plane and Data Plane”Separate policy distribution from request decisions:
- Control plane stores limits, tenant tiers, and degradation rules.
- Data plane enforces limits locally with low-latency caches.
Implementation pattern:
- Source of truth in config DB.
- Push updates through pub/sub.
- Service instances keep in-memory snapshot with TTL fallback.
This avoids making each request depend on a remote policy lookup.
Implementation Blueprint
Section titled “Implementation Blueprint”Minimal practical blueprint:
- API gateway enforces global + tenant token buckets.
- Service middleware enforces endpoint-level sliding window.
- Redis used for shared counters where cross-instance fairness is required.
- Local in-memory fallback limiter when Redis is unavailable.
- Circuit breakers for downstream dependencies.
- Prioritized worker queues for async jobs.
Pseudo flow:
if global_limit_exceeded: return 429if tenant_limit_exceeded: return 429if endpoint_concurrency_high: if endpoint_optional: return degraded_response else: queue_or_fail_fastif dependency_unhealthy: use cache/fallbackprocess_requestPython Implementation (FastAPI + Redis)
Section titled “Python Implementation (FastAPI + Redis)”Below is a minimal implementation of layered limits and degradation controls.
import osimport timefrom fastapi import FastAPI, Requestfrom fastapi.responses import JSONResponseimport redis.asyncio as redis
app = FastAPI()r = redis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379/0"), decode_responses=True)
LUA_LIMIT = """local key = KEYS[1]local limit = tonumber(ARGV[1])local window = tonumber(ARGV[2])local current = redis.call("INCR", key)if current == 1 then redis.call("EXPIRE", key, window) endif current > limit then return 0 endreturn 1"""
limit_script = r.register_script(LUA_LIMIT)
POLICY = { "global": {"limit": 2000, "window": 60}, "tenant_default": {"limit": 300, "window": 60}, "endpoint": { "/search": {"limit": 120, "window": 60}, "/feed": {"limit": 600, "window": 60}, },}
DEGRADE_LEVEL = int(os.getenv("DEGRADE_LEVEL", "0"))
async def allow(key: str, limit: int, window: int) -> bool: try: result = await limit_script(keys=[key], args=[limit, window]) return result == 1 except Exception: # Fallback choice depends on endpoint criticality; this is fail-open. return True
def tenant_id(req: Request) -> str: return req.headers.get("x-tenant-id", "anonymous")
@app.middleware("http")async def protection_middleware(request: Request, call_next): path = request.url.path t_id = tenant_id(request) minute = int(time.time() // 60)
g = POLICY["global"] if not await allow(f"rl:global:{minute}", g["limit"], g["window"]): return JSONResponse({"error": "global rate limit"}, status_code=429, headers={"Retry-After": "5"})
t = POLICY["tenant_default"] if not await allow(f"rl:tenant:{t_id}:{minute}", t["limit"], t["window"]): return JSONResponse({"error": "tenant rate limit"}, status_code=429, headers={"Retry-After": "5"})
ep = POLICY["endpoint"].get(path) if ep and not await allow(f"rl:endpoint:{path}:{minute}", ep["limit"], ep["window"]): return JSONResponse({"error": "endpoint rate limit"}, status_code=429, headers={"Retry-After": "5"})
if DEGRADE_LEVEL >= 2 and path == "/feed": return JSONResponse({"items": [], "degraded": True, "reason": "high load"}, status_code=200)
if DEGRADE_LEVEL >= 3 and path == "/search": return JSONResponse({"error": "temporarily unavailable", "degraded": True}, status_code=503)
return await call_next(request)
@app.get("/feed")async def feed(): return {"items": ["a", "b", "c"], "degraded": False}
@app.get("/search")async def search(q: str): return {"query": q, "results": ["r1", "r2"]}Run locally:
pip install fastapi uvicorn redisuvicorn app:app --reloadQuick test:
curl -H "x-tenant-id: t1" "http://127.0.0.1:8000/search?q=test"Docker Compose Demo (FastAPI + Redis)
Section titled “Docker Compose Demo (FastAPI + Redis)”You can run the Python example end-to-end with Docker using these files.
app.py:
import osimport timefrom fastapi import FastAPI, Requestfrom fastapi.responses import JSONResponseimport redis.asyncio as redis
app = FastAPI()r = redis.from_url(os.getenv("REDIS_URL", "redis://redis:6379/0"), decode_responses=True)
LUA_LIMIT = """local key = KEYS[1]local limit = tonumber(ARGV[1])local window = tonumber(ARGV[2])local current = redis.call("INCR", key)if current == 1 then redis.call("EXPIRE", key, window) endif current > limit then return 0 endreturn 1"""
limit_script = r.register_script(LUA_LIMIT)
POLICY = { "global": {"limit": 2000, "window": 60}, "tenant_default": {"limit": 120, "window": 60}, "endpoint": {"/search": {"limit": 20, "window": 60}},}
DEGRADE_LEVEL = int(os.getenv("DEGRADE_LEVEL", "0"))
async def allow(key: str, limit: int, window: int) -> bool: try: return (await limit_script(keys=[key], args=[limit, window])) == 1 except Exception: return True
@app.middleware("http")async def limiter(request: Request, call_next): path = request.url.path tenant = request.headers.get("x-tenant-id", "anonymous") minute = int(time.time() // 60)
g = POLICY["global"] if not await allow(f"rl:g:{minute}", g["limit"], g["window"]): return JSONResponse({"error": "global limit"}, status_code=429, headers={"Retry-After": "5"})
t = POLICY["tenant_default"] if not await allow(f"rl:t:{tenant}:{minute}", t["limit"], t["window"]): return JSONResponse({"error": "tenant limit"}, status_code=429, headers={"Retry-After": "5"})
ep = POLICY["endpoint"].get(path) if ep and not await allow(f"rl:e:{path}:{minute}", ep["limit"], ep["window"]): return JSONResponse({"error": "endpoint limit"}, status_code=429, headers={"Retry-After": "5"})
if DEGRADE_LEVEL >= 3 and path == "/search": return JSONResponse({"error": "temporarily unavailable", "degraded": True}, status_code=503)
return await call_next(request)
@app.get("/health")async def health(): return {"ok": True, "degrade_level": DEGRADE_LEVEL}
@app.get("/search")async def search(q: str): return {"query": q, "results": ["r1", "r2"]}requirements.txt:
fastapi==0.116.1uvicorn[standard]==0.35.0redis==6.4.0Dockerfile:
FROM python:3.12-slimWORKDIR /appCOPY requirements.txt .RUN pip install --no-cache-dir -r requirements.txtCOPY app.py .EXPOSE 8000CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]docker-compose.yml:
version: "3.9"services: redis: image: redis:7-alpine ports: - "6379:6379"
api: build: . environment: REDIS_URL: redis://redis:6379/0 DEGRADE_LEVEL: "0" ports: - "8000:8000" depends_on: - redisRun:
docker compose up --buildValidate health:
curl http://127.0.0.1:8000/healthGenerate traffic and observe limits:
for i in $(seq 1 30); do curl -s -o /dev/null -w "%{http_code}\n" -H "x-tenant-id: demo" \ "http://127.0.0.1:8000/search?q=test-$i"doneTest degradation mode:
DEGRADE_LEVEL=3 docker compose up --buildcurl -i -H "x-tenant-id: demo" "http://127.0.0.1:8000/search?q=test"Observability and SLOs
Section titled “Observability and SLOs”Track these metrics per endpoint and per tenant:
rate_limited_requests_totaldegraded_responses_totalrequest_latency_ms(p50/p95/p99)dependency_timeout_totalqueue_depthandqueue_wait_mscircuit_breaker_state
Key alerts:
- Sharp increase in 429s for paid tiers.
- Sustained p99 latency above SLO.
- Degradation level at 3 or higher for more than N minutes.
Failure Modes and Mitigations
Section titled “Failure Modes and Mitigations”- Retry storms from clients
- Mitigation: jittered backoff guidance, stricter edge limits, retry budgets.
- Noisy neighbor tenants
- Mitigation: per-tenant quotas and weighted fairness.
- Shared cache or Redis outage
- Mitigation: local fallback limiter with conservative defaults.
- Over-degrading core traffic
- Mitigation: separate critical and optional routes with distinct policies.
- Slow downstream dependency
- Mitigation: timeout budgets, circuit breaker, stale-cache fallback.
Rollout Plan
Section titled “Rollout Plan”- Start in shadow mode (observe-only, no blocking).
- Enable 429 enforcement for a small traffic percentage.
- Validate false-positive rate and tenant impact.
- Enable degradation levels one by one with canary traffic.
- Run load tests and game days for dependency failures.
- Document incident runbook with exact SLO thresholds and owner actions.
Conclusion
Section titled “Conclusion”Rate limiting without degradation causes hard failures during incidents. Degradation without rate limiting allows overload to spread.
At scale, both must work together:
- Limit demand early.
- Protect critical paths first.
- Degrade intentionally and recover automatically.