System Design: Service Degradation and Rate Limiting at Scale

Overview
Design Goals
Architecture
Rate Limiting Strategy
Service Degradation Ladder
Control Plane and Data Plane
Implementation Blueprint
Python Implementation (FastAPI + Redis)
Docker Compose Demo (FastAPI + Redis)
Observability and SLOs
Failure Modes and Mitigations
Rollout Plan
Conclusion

System Design: Service Degradation and Rate Limiting at Scale

Overview

At scale, outages are often not caused by a single hard failure. They come from overload, retries, queue buildup, and dependencies timing out together.

Two controls prevent this cascade:

Rate limiting to control demand
Service degradation to preserve core functionality under stress

The goal is not to keep every feature alive. The goal is to keep critical paths alive.

Design Goals

Protect core APIs from traffic spikes and abusive clients.
Keep p95/p99 latency bounded during partial failures.
Degrade non-critical features before core features fail.
Recover quickly and automatically when pressure drops.

Architecture

Use layered protection, not a single limiter:

Edge limiter at CDN or API gateway (cheap rejection early).
Service-level limiter inside each API service (local safety).
Dependency limiter for expensive downstream calls (DB/search/payments).
Queue and worker backpressure controls (bounded concurrency).

High-level request path:

Client -> CDN/WAF -> API Gateway -> Service -> Dependencies
             |           |             |
         coarse limit  tenant limit   per-dependency limit

Rate Limiting Strategy

Use different algorithms for different layers:

Token bucket at edge for burst tolerance.
Sliding window counter for tenant/account fairness.
Concurrency limits for expensive handlers.

Recommended key hierarchy:

Global limit: protect total system capacity.
Per-tenant limit: prevent noisy neighbors.
Per-endpoint limit: protect expensive routes.
Per-user/IP fallback: mitigate abuse.

Response behavior:

Return 429 Too Many Requests.
Include Retry-After.
Include rate limit headers for client adaptation.

Service Degradation Ladder

Define behavior before incidents happen. Example ladder:

Level 0 (Normal): all features enabled.
Level 1 (Guarded): disable non-essential background refresh jobs.
Level 2 (Constrained): serve cached results for read-heavy endpoints.
Level 3 (Critical): disable optional features (recommendations, analytics widgets, exports).
Level 4 (Emergency): read-only mode for non-critical domains; core transactions only.

Degradation should be controlled by feature flags and SLO triggers, not manual code edits.

Control Plane and Data Plane

Separate policy distribution from request decisions:

Control plane stores limits, tenant tiers, and degradation rules.
Data plane enforces limits locally with low-latency caches.

Implementation pattern:

Source of truth in config DB.
Push updates through pub/sub.
Service instances keep in-memory snapshot with TTL fallback.

This avoids making each request depend on a remote policy lookup.

Implementation Blueprint

Minimal practical blueprint:

API gateway enforces global + tenant token buckets.
Service middleware enforces endpoint-level sliding window.
Redis used for shared counters where cross-instance fairness is required.
Local in-memory fallback limiter when Redis is unavailable.
Circuit breakers for downstream dependencies.
Prioritized worker queues for async jobs.

Pseudo flow:

if global_limit_exceeded: return 429
if tenant_limit_exceeded: return 429
if endpoint_concurrency_high:
    if endpoint_optional: return degraded_response
    else: queue_or_fail_fast
if dependency_unhealthy: use cache/fallback
process_request

Python Implementation (FastAPI + Redis)

Below is a minimal implementation of layered limits and degradation controls.

import os
import time
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import redis.asyncio as redis

app = FastAPI()
r = redis.from_url(os.getenv("REDIS_URL", "redis://localhost:6379/0"), decode_responses=True)

LUA_LIMIT = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call("INCR", key)
if current == 1 then redis.call("EXPIRE", key, window) end
if current > limit then return 0 end
return 1
"""

limit_script = r.register_script(LUA_LIMIT)

POLICY = {
    "global": {"limit": 2000, "window": 60},
    "tenant_default": {"limit": 300, "window": 60},
    "endpoint": {
        "/search": {"limit": 120, "window": 60},
        "/feed": {"limit": 600, "window": 60},
    },
}

DEGRADE_LEVEL = int(os.getenv("DEGRADE_LEVEL", "0"))

async def allow(key: str, limit: int, window: int) -> bool:
    try:
        result = await limit_script(keys=[key], args=[limit, window])
        return result == 1
    except Exception:
        # Fallback choice depends on endpoint criticality; this is fail-open.
        return True

def tenant_id(req: Request) -> str:
    return req.headers.get("x-tenant-id", "anonymous")

@app.middleware("http")
async def protection_middleware(request: Request, call_next):
    path = request.url.path
    t_id = tenant_id(request)
    minute = int(time.time() // 60)

    g = POLICY["global"]
    if not await allow(f"rl:global:{minute}", g["limit"], g["window"]):
        return JSONResponse({"error": "global rate limit"}, status_code=429, headers={"Retry-After": "5"})

    t = POLICY["tenant_default"]
    if not await allow(f"rl:tenant:{t_id}:{minute}", t["limit"], t["window"]):
        return JSONResponse({"error": "tenant rate limit"}, status_code=429, headers={"Retry-After": "5"})

    ep = POLICY["endpoint"].get(path)
    if ep and not await allow(f"rl:endpoint:{path}:{minute}", ep["limit"], ep["window"]):
        return JSONResponse({"error": "endpoint rate limit"}, status_code=429, headers={"Retry-After": "5"})

    if DEGRADE_LEVEL >= 2 and path == "/feed":
        return JSONResponse({"items": [], "degraded": True, "reason": "high load"}, status_code=200)

    if DEGRADE_LEVEL >= 3 and path == "/search":
        return JSONResponse({"error": "temporarily unavailable", "degraded": True}, status_code=503)

    return await call_next(request)

@app.get("/feed")
async def feed():
    return {"items": ["a", "b", "c"], "degraded": False}

@app.get("/search")
async def search(q: str):
    return {"query": q, "results": ["r1", "r2"]}

Run locally:

pip install fastapi uvicorn redis
uvicorn app:app --reload

Quick test:

curl -H "x-tenant-id: t1" "http://127.0.0.1:8000/search?q=test"

Docker Compose Demo (FastAPI + Redis)

You can run the Python example end-to-end with Docker using these files.

app.py:

import os
import time
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import redis.asyncio as redis

app = FastAPI()
r = redis.from_url(os.getenv("REDIS_URL", "redis://redis:6379/0"), decode_responses=True)

LUA_LIMIT = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call("INCR", key)
if current == 1 then redis.call("EXPIRE", key, window) end
if current > limit then return 0 end
return 1
"""

limit_script = r.register_script(LUA_LIMIT)

POLICY = {
    "global": {"limit": 2000, "window": 60},
    "tenant_default": {"limit": 120, "window": 60},
    "endpoint": {"/search": {"limit": 20, "window": 60}},
}

DEGRADE_LEVEL = int(os.getenv("DEGRADE_LEVEL", "0"))

async def allow(key: str, limit: int, window: int) -> bool:
    try:
        return (await limit_script(keys=[key], args=[limit, window])) == 1
    except Exception:
        return True

@app.middleware("http")
async def limiter(request: Request, call_next):
    path = request.url.path
    tenant = request.headers.get("x-tenant-id", "anonymous")
    minute = int(time.time() // 60)

    g = POLICY["global"]
    if not await allow(f"rl:g:{minute}", g["limit"], g["window"]):
        return JSONResponse({"error": "global limit"}, status_code=429, headers={"Retry-After": "5"})

    t = POLICY["tenant_default"]
    if not await allow(f"rl:t:{tenant}:{minute}", t["limit"], t["window"]):
        return JSONResponse({"error": "tenant limit"}, status_code=429, headers={"Retry-After": "5"})

    ep = POLICY["endpoint"].get(path)
    if ep and not await allow(f"rl:e:{path}:{minute}", ep["limit"], ep["window"]):
        return JSONResponse({"error": "endpoint limit"}, status_code=429, headers={"Retry-After": "5"})

    if DEGRADE_LEVEL >= 3 and path == "/search":
        return JSONResponse({"error": "temporarily unavailable", "degraded": True}, status_code=503)

    return await call_next(request)

@app.get("/health")
async def health():
    return {"ok": True, "degrade_level": DEGRADE_LEVEL}

@app.get("/search")
async def search(q: str):
    return {"query": q, "results": ["r1", "r2"]}

requirements.txt:

fastapi==0.116.1
uvicorn[standard]==0.35.0
redis==6.4.0

Dockerfile:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

docker-compose.yml:

version: "3.9"
services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  api:
    build: .
    environment:
      REDIS_URL: redis://redis:6379/0
      DEGRADE_LEVEL: "0"
    ports:
      - "8000:8000"
    depends_on:
      - redis

Run:

docker compose up --build

Validate health:

curl http://127.0.0.1:8000/health

Generate traffic and observe limits:

for i in $(seq 1 30); do
  curl -s -o /dev/null -w "%{http_code}\n" -H "x-tenant-id: demo" \
    "http://127.0.0.1:8000/search?q=test-$i"
done

Test degradation mode:

DEGRADE_LEVEL=3 docker compose up --build
curl -i -H "x-tenant-id: demo" "http://127.0.0.1:8000/search?q=test"

Observability and SLOs

Track these metrics per endpoint and per tenant:

rate_limited_requests_total
degraded_responses_total
request_latency_ms (p50/p95/p99)
dependency_timeout_total
queue_depth and queue_wait_ms
circuit_breaker_state

Key alerts:

Sharp increase in 429s for paid tiers.
Sustained p99 latency above SLO.
Degradation level at 3 or higher for more than N minutes.

Failure Modes and Mitigations

Retry storms from clients
- Mitigation: jittered backoff guidance, stricter edge limits, retry budgets.
Noisy neighbor tenants
- Mitigation: per-tenant quotas and weighted fairness.
Shared cache or Redis outage
- Mitigation: local fallback limiter with conservative defaults.
Over-degrading core traffic
- Mitigation: separate critical and optional routes with distinct policies.
Slow downstream dependency
- Mitigation: timeout budgets, circuit breaker, stale-cache fallback.

Rollout Plan

Start in shadow mode (observe-only, no blocking).
Enable 429 enforcement for a small traffic percentage.
Validate false-positive rate and tenant impact.
Enable degradation levels one by one with canary traffic.
Run load tests and game days for dependency failures.
Document incident runbook with exact SLO thresholds and owner actions.

Conclusion

Rate limiting without degradation causes hard failures during incidents. Degradation without rate limiting allows overload to spread.

At scale, both must work together:

Limit demand early.
Protect critical paths first.
Degrade intentionally and recover automatically.

System Design: Service Degradation and Rate Limiting at Scale

Table of Contents

System Design: Service Degradation and Rate Limiting at Scale

Overview

Design Goals

Architecture

Rate Limiting Strategy

Service Degradation Ladder

Control Plane and Data Plane

Implementation Blueprint

Python Implementation (FastAPI + Redis)

Docker Compose Demo (FastAPI + Redis)

Observability and SLOs

Failure Modes and Mitigations

Rollout Plan

Conclusion