Rate Limiter Design: Algorithms and Distributed Patterns

Every production API needs a rate limiter. Without one, a single misbehaving client can bring down your entire service — whether through a bug that fires requests in a tight loop, a DDoS attack, or simply an unexpectedly viral feature that causes a traffic spike. Rate limiting is not optional infrastructure; it is survival infrastructure.

This post covers the core algorithms, their tradeoffs, distributed implementations with Redis, and where rate limiting belongs in your architecture.

Why Rate Limiting Matters

Rate limiting serves four critical purposes:

Prevent resource starvation: A single user should not be able to consume all your server resources
Improve reliability: By rejecting excess traffic early, you keep the service healthy for everyone else
Reduce cost: Especially with cloud/serverless billing, runaway traffic translates directly to money
Security: Rate limiting is your first line of defense against brute-force attacks, credential stuffing, and scraping

Real-World Failures

GitHub (2018): A memcached amplification DDoS attack sent 1.35 Tbps of traffic to GitHub. Rate limiting at the edge helped them recover.
Twitter API abuse: Without rate limits, bots would scrape the entire platform in hours. Twitter enforces 300 requests per 15-minute window on most endpoints.
Stripe's approach: Stripe uses a sophisticated token bucket with separate limits per endpoint, per API key, and per IP — protecting both their infrastructure and their customers.

Algorithm 1: Token Bucket

The token bucket is the most widely used rate limiting algorithm. It is simple, memory-efficient, and allows bursts.

How It Works

A bucket holds tokens, with a maximum capacity of b tokens
Tokens are added to the bucket at a fixed rate of r tokens per second
When a request arrives, it consumes one token
If the bucket is empty, the request is rejected
The bucket never exceeds its maximum capacity

Time 0s: Bucket has 5 tokens (capacity = 5, rate = 1/sec)

Request 1: ✓ (4 tokens remaining)
Request 2: ✓ (3 tokens remaining)
Request 3: ✓ (2 tokens remaining)

Time 1s: Refill +1 token (3 tokens)

Request 4: ✓ (2 tokens remaining)
Request 5: ✓ (1 token remaining)
Request 6: ✓ (0 tokens remaining)
Request 7: ✗ REJECTED (bucket empty)

Time 2s: Refill +1 token (1 token)

Request 8: ✓ (0 tokens remaining)

Implementation

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.monotonic()

    def allow_request(self) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_refill

        # Refill tokens based on elapsed time
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Example: 10 requests per second, burst of 20
limiter = TokenBucket(capacity=20, refill_rate=10)

Pros and Cons

Pros:

Memory efficient — only stores token count and timestamp
Allows bursts up to the bucket capacity
Smooth rate limiting over time
Simple to understand and implement

Cons:

Tuning two parameters (capacity and rate) requires thought
Race conditions in distributed environments without proper locking

Algorithm 2: Leaky Bucket

The leaky bucket is like a FIFO queue with a fixed drain rate. Requests enter the bucket and are processed at a constant rate, regardless of input burstiness.

How It Works

Requests enter a queue (the bucket)
The queue has a fixed capacity — if full, new requests are rejected
Requests are drained (processed) from the queue at a constant rate
Unlike token bucket, the output rate is perfectly smooth

Input rate varies:   |||  |    |||||  ||  |
                     v v v  v   v v v v v  v v  v
                    ┌─────────────────────────┐
Bucket (queue):     │ ● ● ● ● ● ●            │  capacity = 10
                    └─────────┬───────────────┘
                              │
Output rate constant:  ─ ● ─ ● ─ ● ─ ● ─ ● ─  (drain rate)

Implementation

from collections import deque
import time

class LeakyBucket:
    def __init__(self, capacity: int, drain_rate: float):
        self.capacity = capacity
        self.drain_rate = drain_rate  # requests per second
        self.queue = deque()
        self.last_drain = time.monotonic()

    def allow_request(self) -> bool:
        now = time.monotonic()
        elapsed = now - self.last_drain

        # Drain processed requests
        drain_count = int(elapsed * self.drain_rate)
        if drain_count > 0:
            for _ in range(min(drain_count, len(self.queue))):
                self.queue.popleft()
            self.last_drain = now

        # Check if bucket has room
        if len(self.queue) < self.capacity:
            self.queue.append(now)
            return True
        return False

Pros and Cons

Pros:

Produces a perfectly smooth output rate
Predictable processing time for queued requests

Cons:

Bursts of traffic fill the queue — recent requests may wait behind old ones
Not ideal when you want to allow controlled bursts
Uses more memory (stores queue of request timestamps)

Algorithm 3: Fixed Window Counter

The simplest algorithm. Divide time into fixed windows and count requests per window.

How It Works

Divide time into windows of size w (e.g., 1 minute)
Each window has a counter starting at 0
Each request increments the counter for the current window
If the counter exceeds the limit, reject the request

Window 1 (0:00 - 0:59)     Window 2 (1:00 - 1:59)
Limit: 5 requests           Limit: 5 requests

|●|●|●|●|●|✗|              |●|●|●|              |
 1  2  3  4  5  rejected     1  2  3

But what about the boundary?

          ←─── 1 minute ───→
Window 1: .........|●●●●●|
Window 2:          |●●●●●|.........
                    ↑
              Boundary burst: 10 requests in ~1 second!

The Boundary Problem

The critical flaw is that a user can send limit requests at the end of one window and limit requests at the start of the next, effectively doubling their rate at the boundary.

Implementation

import time
from collections import defaultdict

class FixedWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counters = defaultdict(int)

    def _get_window_key(self, user_id: str) -> str:
        window = int(time.time() // self.window_seconds)
        return f"{user_id}:{window}"

    def allow_request(self, user_id: str) -> bool:
        key = self._get_window_key(user_id)
        self.counters[key] += 1
        return self.counters[key] <= self.limit

Pros and Cons

Pros:

Extremely simple and memory-efficient (one counter per window)
Easy to implement with Redis INCR + EXPIRE

Cons:

Boundary burst problem allows 2x the intended rate
Not suitable for strict rate limiting requirements

Algorithm 4: Sliding Window Log

Fixes the boundary problem by tracking individual request timestamps.

How It Works

Store the timestamp of every request in a sorted set
When a new request arrives, remove all timestamps older than the window
Count the remaining timestamps
If count exceeds the limit, reject

import time

class SlidingWindowLog:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.requests = {}  # user_id -> list of timestamps

    def allow_request(self, user_id: str) -> bool:
        now = time.time()
        window_start = now - self.window_seconds

        if user_id not in self.requests:
            self.requests[user_id] = []

        # Remove expired timestamps
        self.requests[user_id] = [
            ts for ts in self.requests[user_id]
            if ts > window_start
        ]

        if len(self.requests[user_id]) < self.limit:
            self.requests[user_id].append(now)
            return True
        return False

Pros and Cons

Pros:

Perfectly accurate — no boundary burst problem
Smooth, fair rate limiting

Cons:

High memory usage — stores every request timestamp
O(n) cleanup on each request (can optimize with sorted sets)

Algorithm 5: Sliding Window Counter (Recommended)

The best of both worlds. Combines the memory efficiency of fixed windows with the accuracy of sliding windows.

How It Works

Track counters for the current and previous fixed windows
Estimate the count in the sliding window using a weighted average

Previous window count: 7 (limit: 10)
Current window count:  3
Current position: 30% into the current window

Estimated count = (7 * 0.70) + (3 * 1.0) = 4.9 + 3 = 7.9
Since 7.9 < 10: ALLOW

Implementation

import time
import math

class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        self.windows = {}  # user_id -> {window_id: count}

    def allow_request(self, user_id: str) -> bool:
        now = time.time()
        current_window = int(now // self.window_seconds)
        previous_window = current_window - 1

        # Position within current window (0.0 to 1.0)
        window_progress = (now % self.window_seconds) / self.window_seconds

        if user_id not in self.windows:
            self.windows[user_id] = {}

        prev_count = self.windows[user_id].get(previous_window, 0)
        curr_count = self.windows[user_id].get(current_window, 0)

        # Weighted estimate
        estimated = prev_count * (1 - window_progress) + curr_count

        if estimated < self.limit:
            self.windows[user_id][current_window] = curr_count + 1
            return True
        return False

Why This Is the Best Choice

Algorithm	Memory	Accuracy	Burst Handling
Token Bucket	Low	Good	Allows controlled bursts
Leaky Bucket	Medium	Good	Smooths output
Fixed Window	Low	Poor (boundary)	2x burst at boundary
Sliding Window Log	High	Perfect	No bursts
Sliding Window Counter	Low	Near-perfect	Minimal boundary effect

For most production systems, use sliding window counter for general rate limiting and token bucket when you explicitly want to allow bursts.

Distributed Rate Limiting with Redis

In a multi-server environment, you cannot use in-memory rate limiters — each server would maintain its own count, effectively multiplying the limit by the number of servers. You need a centralized counter, and Redis is the standard choice.

Token Bucket in Redis (Lua Script)

We use a Lua script to make the check-and-update atomic:

-- rate_limit.lua
-- KEYS[1] = rate limit key
-- ARGV[1] = bucket capacity
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp (seconds)
-- ARGV[4] = tokens to consume (usually 1)

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

-- Get current state
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

-- Calculate refill
local elapsed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + elapsed * refill_rate)

-- Check if request is allowed
local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

-- Save state
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)

return { allowed, tokens }

Sliding Window Counter in Redis

import redis
import time

class DistributedSlidingWindow:
    def __init__(self, redis_client: redis.Redis,
                 limit: int, window_seconds: int):
        self.redis = redis_client
        self.limit = limit
        self.window = window_seconds

    def allow_request(self, identifier: str) -> bool:
        now = time.time()
        current_window = int(now // self.window)
        previous_window = current_window - 1
        window_progress = (now % self.window) / self.window

        pipe = self.redis.pipeline()
        curr_key = f"rl:{identifier}:{current_window}"
        prev_key = f"rl:{identifier}:{previous_window}"
        pipe.get(prev_key)
        pipe.get(curr_key)
        prev_count, curr_count = pipe.execute()

        prev_count = int(prev_count or 0)
        curr_count = int(curr_count or 0)

        estimated = prev_count * (1 - window_progress) + curr_count

        if estimated >= self.limit:
            return False

        # Increment current window counter
        pipe = self.redis.pipeline()
        pipe.incr(curr_key)
        pipe.expire(curr_key, self.window * 2)
        pipe.execute()

        return True

Handling Redis Failures

What happens when Redis is down? You have three options:

Fail open: Allow all requests (risk overload, but maintain availability)
Fail closed: Reject all requests (safe, but causes outages)
Local fallback: Switch to an in-memory rate limiter per server (best option)

class ResilientRateLimiter:
    def __init__(self, redis_limiter, local_limiter):
        self.redis_limiter = redis_limiter
        self.local_limiter = local_limiter

    def allow_request(self, identifier: str) -> bool:
        try:
            return self.redis_limiter.allow_request(identifier)
        except redis.ConnectionError:
            # Fall back to local limiter
            # Divide the limit by estimated server count
            return self.local_limiter.allow_request(identifier)

Rate Limiting at Different Layers

Rate limiting is not a single-point solution. You should apply it at multiple layers:

Layer 1: Client-Side

Prevent well-behaved clients from accidentally flooding your API:

class RateLimitedClient {
  private queue: Array<() => void> = [];
  private processing = false;
  private minInterval = 100; // ms between requests

  async request(url: string, options?: RequestInit): Promise<Response> {
    return new Promise((resolve, reject) => {
      this.queue.push(async () => {
        try {
          const response = await fetch(url, options);
          if (response.status === 429) {
            const retryAfter = response.headers.get('Retry-After');
            await this.sleep(parseInt(retryAfter || '1') * 1000);
            resolve(await fetch(url, options));
          } else {
            resolve(response);
          }
        } catch (err) {
          reject(err);
        }
      });
      this.processQueue();
    });
  }

  private async processQueue() {
    if (this.processing) return;
    this.processing = true;
    while (this.queue.length > 0) {
      const task = this.queue.shift()!;
      await task();
      await this.sleep(this.minInterval);
    }
    this.processing = false;
  }

  private sleep(ms: number) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Layer 2: API Gateway / Load Balancer

This is your coarse-grained defense. Apply broad limits here:

Per-IP rate limiting: Protect against DDoS
Per-API-key limiting: Enforce plan-based quotas
Global rate limiting: Protect downstream services from overload

Most cloud providers offer this built-in (AWS API Gateway, Cloudflare, nginx):

# nginx rate limiting
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

server {
    location /api/ {
        limit_req zone=api burst=20 nodelay;
        limit_req_status 429;
        proxy_pass http://backend;
    }
}

Layer 3: Application Level

Fine-grained, business-logic-aware rate limiting:

# Different limits for different operations
RATE_LIMITS = {
    "login":        {"limit": 5,    "window": 300},   # 5 per 5 min
    "password_reset": {"limit": 3,  "window": 3600},  # 3 per hour
    "api_read":     {"limit": 1000, "window": 60},     # 1000 per min
    "api_write":    {"limit": 100,  "window": 60},     # 100 per min
    "file_upload":  {"limit": 10,   "window": 3600},   # 10 per hour
}

async def rate_limit_middleware(request, call_next):
    operation = classify_request(request)
    config = RATE_LIMITS.get(operation)

    if config is None:
        return await call_next(request)

    identifier = get_identifier(request)  # user ID or IP
    limiter = SlidingWindowCounter(**config)

    if not limiter.allow_request(identifier):
        return JSONResponse(
            status_code=429,
            content={"error": "Rate limit exceeded"},
            headers={
                "Retry-After": str(config["window"]),
                "X-RateLimit-Limit": str(config["limit"]),
                "X-RateLimit-Remaining": "0",
                "X-RateLimit-Reset": str(int(time.time()) + config["window"]),
            }
        )

    return await call_next(request)

Response Headers: Be a Good API Citizen

Always include rate limit information in your responses so clients can self-regulate:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 742
X-RateLimit-Reset: 1679270400

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1679270400

These headers tell the client exactly how many requests they have left and when the window resets. Well-designed clients use these to avoid hitting the limit in the first place.

Advanced Patterns

Tiered Rate Limiting

Different users get different limits based on their plan:

TIER_LIMITS = {
    "free":       {"rpm": 60,    "rpd": 1000},
    "basic":      {"rpm": 600,   "rpd": 10000},
    "pro":        {"rpm": 6000,  "rpd": 100000},
    "enterprise": {"rpm": 60000, "rpd": 1000000},
}

def get_rate_limit(user) -> dict:
    return TIER_LIMITS.get(user.plan, TIER_LIMITS["free"])

Adaptive Rate Limiting

Adjust limits dynamically based on system health:

class AdaptiveRateLimiter:
    def __init__(self, base_limit: int):
        self.base_limit = base_limit
        self.current_limit = base_limit

    def adjust(self, cpu_usage: float, error_rate: float):
        """Called periodically by health monitor."""
        if cpu_usage > 0.8 or error_rate > 0.05:
            # System under stress — reduce limit
            self.current_limit = max(
                self.base_limit * 0.5,
                self.current_limit * 0.9
            )
        elif cpu_usage < 0.5 and error_rate < 0.01:
            # System healthy — gradually restore
            self.current_limit = min(
                self.base_limit,
                self.current_limit * 1.1
            )

Rate Limiting with Priorities

Not all requests are equal. A payment webhook should never be rate-limited while an analytics query can wait:

PRIORITY_OVERRIDES = {
    "webhook": float('inf'),     # Never rate limit webhooks
    "health_check": float('inf'),
    "critical_write": 10.0,      # 10x normal limit
    "normal": 1.0,               # Base limit
    "background": 0.5,           # Half the normal limit
}

Summary: Choosing the Right Algorithm

Here is a decision framework:

Need to allow bursts? Use token bucket (most APIs)
Need smooth, constant output? Use leaky bucket (message queues, stream processing)
Need simplicity and approximate limiting? Use fixed window counter (internal services)
Need accuracy without boundary issues? Use sliding window counter (production APIs)
Need perfect accuracy? Use sliding window log (compliance, billing)

For a system design interview, start with the sliding window counter for general rate limiting, then discuss token bucket when the interviewer asks about burst handling. Always mention the distributed Redis implementation and the multi-layer approach. These details demonstrate that you understand how rate limiting works in real production systems, not just in textbook examples.

Why Rate Limiting Matters

Real-World Failures

Algorithm 1: Token Bucket

How It Works

Implementation

Pros and Cons

Algorithm 2: Leaky Bucket

How It Works

Implementation

Pros and Cons

Algorithm 3: Fixed Window Counter

How It Works

The Boundary Problem

Implementation

Pros and Cons

Algorithm 4: Sliding Window Log

How It Works

Pros and Cons

Algorithm 5: Sliding Window Counter (Recommended)

How It Works

Implementation

Why This Is the Best Choice

Distributed Rate Limiting with Redis

Token Bucket in Redis (Lua Script)

Sliding Window Counter in Redis

Handling Redis Failures

Rate Limiting at Different Layers

Layer 1: Client-Side

Layer 2: API Gateway / Load Balancer

Layer 3: Application Level

Response Headers: Be a Good API Citizen

Advanced Patterns

Tiered Rate Limiting

Adaptive Rate Limiting

Rate Limiting with Priorities

Summary: Choosing the Right Algorithm

Keep Reading

Designing a Scalable Notification System

API Design Best Practices: REST, GraphQL, and gRPC Compared

Event-Driven Architecture: Patterns, Pitfalls, and Practical Guidance

Comments