API Rate Limiting: Strategies, Algorithms, and Best Practices

API Rate Limiting: Strategies, Algorithms, and Best Practices

Every API that's reachable from the internet needs rate limiting. Without it, a single misbehaving client — whether a buggy script or an intentional attacker — can degrade service for everyone else. Rate limiting is not complicated to implement, but the algorithm you choose and how you communicate limits to clients makes a real difference in practice.

Why Rate Limiting Exists

The reasons stack. At the lowest level, your server has finite resources — CPU, memory, database connections, third-party API quotas. A client hammering 1,000 requests per second can exhaust those resources and degrade service for everyone else.

Beyond raw capacity, rate limiting protects against specific abuse patterns: credential stuffing (automated login attempts), scraping (bulk data extraction), DDoS amplification (using your API as an attack multiplier), and runaway bugs in client code that loops without backoff.

It also enables fair usage — giving all clients a reasonable slice of capacity rather than letting aggressive ones crowd out the rest.

Fixed Window

The simplest algorithm. Divide time into fixed intervals (say, 1 minute), count requests per window, reject once the count exceeds the limit.

Window: 00:00 - 01:00  → 100 requests allowed
Window: 01:00 - 02:00  → 100 requests allowed, counter resets

Implementation is trivial: one counter per client, reset on the minute boundary. Redis makes this easy:

async function isRateLimited(clientId) {
  const key = `ratelimit:${clientId}:${Math.floor(Date.now() / 60000)}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, 60);
  return count > 100;
}

The weakness is the boundary burst: a client can send 100 requests at 00:59 and 100 more at 01:01, effectively getting 200 requests in 2 seconds while technically staying within limits. For most APIs that's fine; for sensitive endpoints it isn't.

Sliding Window

Sliding window tracks requests within the last N seconds relative to the current moment, not relative to a fixed clock boundary. The burst problem disappears.

A practical approximation uses two fixed-window counters (current and previous window) and weights them by how far into the current window you are:

effective_count = previous_count × (1 - elapsed_fraction) + current_count

This is how many Redis-based rate limiters work in practice — it's much cheaper than storing a timestamp for every request and gives a close enough approximation of a true sliding window.

Token Bucket

Token bucket is the most common algorithm for production rate limiters. Think of it as a bucket that holds tokens. Each request consumes a token. Tokens refill at a steady rate up to the bucket's capacity.

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second

Client sends 100 requests instantly → bucket drains to 0
Client must wait 10 seconds to have 100 tokens again
Or client can send 10 requests/second indefinitely without waiting

Token bucket allows bursting — a client can use accumulated tokens quickly when needed, then fall back to the sustained rate. That matches how real usage behaves: a user might trigger 20 API calls by loading a dashboard, then be quiet for 30 seconds.

State you need per client: current token count and last refill timestamp. No fixed windows, no clock boundaries.

Leaky Bucket

Leaky bucket works the other way: requests enter a queue at any rate, but they're drained and processed at a fixed constant rate. Overflow is dropped or rejected.

Queue depth: 10 requests
Process rate: 5 requests/second

Burst of 10 → all queued, processed over 2 seconds
Burst of 20 → 10 queued, 10 rejected

Leaky bucket enforces a smooth outbound rate even when inbound traffic is bursty. It's the right choice when protecting a downstream dependency that can't handle bursts — a payment processor or third-party API with its own rate limit. For protecting your own API from clients, token bucket usually fits better because it's more forgiving about short bursts.

What a 429 Response Should Look Like

When you reject a request due to rate limiting, the correct HTTP status is 429 Too Many Requests. The response should include a Retry-After header so clients know when they can try again.

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Retry after 30 seconds.",
  "retry_after": 30
}

Retry-After accepts either a number of seconds or an HTTP date. Seconds is simpler and less error-prone. See HTTP Status Codes Guide for context on the 4xx range.

Rate Limit Headers

Beyond 429 responses, proactive headers let well-behaved clients throttle themselves before hitting the limit. The emerging standard (from the IETF draft) uses:

RateLimit-Limit: 100
RateLimit-Remaining: 73
RateLimit-Reset: 1715200800

Many APIs use X- prefixed variants — they predate the standardization effort:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1715200800

Reset is typically a Unix timestamp (when the window or bucket refills). Some APIs include Retry-After on non-429 responses too, as a hint when the client is getting close to the limit.

GitHub, Stripe, and Cloudflare all use variations of this pattern. Whatever you pick, be consistent and document it.

Client-Side Strategies

If you're writing a client that calls a rate-limited API, the most important behavior to implement is exponential backoff with jitter.

When you get a 429 (or a 503), don't immediately retry. Wait, then retry. If it fails again, wait longer:

async function fetchWithRetry(url, options, attempt = 0) {
  const response = await fetch(url, options);

  if (response.status === 429 || response.status === 503) {
    if (attempt >= 5) throw new Error('Max retries exceeded');

    const retryAfter = response.headers.get('Retry-After');
    const baseDelay = retryAfter ? parseInt(retryAfter) * 1000 : 1000 * Math.pow(2, attempt);

    // Add jitter: randomize within ±25% of base delay
    const jitter = baseDelay * 0.25 * (Math.random() * 2 - 1);
    const delay = Math.round(baseDelay + jitter);

    await new Promise(resolve => setTimeout(resolve, delay));
    return fetchWithRetry(url, options, attempt + 1);
  }

  return response;
}

Why jitter? Without it, a burst of 1,000 clients all hitting a rate limit simultaneously will all retry after the same delay — producing another synchronized burst. Jitter spreads the retries over time and prevents the thundering herd.

Rate Limiting by IP vs API Key vs User

The unit of rate limiting matters.

By IP address is the baseline — no auth required, easy to implement. Weakness: shared IP addresses (NAT, corporate proxies, VPNs) mean many legitimate users share a limit. Also relatively easy to circumvent with proxies.

By API key is the standard for developer APIs. Keys are cheap to issue and revoke, and you can set different limits per tier (free/paid/enterprise). Extract the key from the Authorization header before applying the limit:

const apiKey = req.headers['authorization']?.replace('Bearer ', '');
const limitKey = apiKey || req.ip;  // fall back to IP if no key

By authenticated user ID works well for logged-in products. One user's usage doesn't affect others sharing a NAT. Can be combined with IP limiting for unauthenticated endpoints.

By endpoint — different limits for different operations is common and reasonable. A search endpoint that hits the database should have a tighter limit than a health check. Mutation endpoints (POST/PUT/DELETE) often warrant a lower limit than read endpoints.

Implementation Notes

For a Node.js/Express backend, the express-rate-limit package covers most common use cases. Pair it with rate-limit-redis if you're running multiple server instances — otherwise each instance has its own counter and the limits are effectively multiplied by your instance count.

const rateLimit = require('express-rate-limit');

const apiLimiter = rateLimit({
  windowMs: 60 * 1000,  // 1 minute
  max: 100,
  standardHeaders: true,  // Return RateLimit-* headers
  legacyHeaders: false,
  message: { error: 'rate_limit_exceeded', retry_after: 60 },
});

app.use('/api/', apiLimiter);

Set trust proxy correctly if you're behind a load balancer or reverse proxy. Without it, req.ip returns the proxy's IP and you'll rate-limit your entire user base as a single client.

Formatting API Responses

When debugging rate limit behavior, inspecting the JSON error responses and headers is the fastest way to understand what's happening. The JSON Formatter helps when working with nested rate limit error bodies from third-party APIs. If you're crafting query strings that include rate limit parameters or debugging callback URLs, the URL Encoder saves time on encoding edge cases.

For a broader view of RESTful API design including error response conventions, see REST API Design Best Practices.