DeveloperAPI DesignBackend

API Rate Limiting Explained

The complete reference for API rate limiting — algorithms, HTTP headers, Redis implementation, real-world limits from major APIs, debugging techniques, and production best practices.

18 min read

TL;DR — Key Points

Rate limitingControlling the number of requests a client can make to an API within a specific time window. Protects servers from abuse, DDoS, and ensures fair resource allocation.

Rate limit windowThe time period over which requests are counted. Common windows are per second, per minute, per hour, or per day.

QuotaThe maximum number of requests allowed during a rate limit window. For example, 100 requests per minute means a quota of 100 per 60-second window.

Throttling vs blockingThrottling delays requests gracefully; blocking outright rejects them. Throttling is friendlier for legitimate clients with temporary spikes.

Rate limit headersHTTP response headers (RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset) that inform clients of their quota and reset time.

Token bucketAlgorithm where tokens accumulate in a bucket at a fixed rate. Each request consumes tokens. When empty, requests are rejected until tokens regenerate.

Sliding windowAlgorithm that counts requests in a moving time window, rejecting requests when the count exceeds the threshold.

Leaky bucketAlgorithm that processes requests at a fixed rate, queuing excess requests. Ensures smooth traffic flow even with bursts.

Distributed rate limitingRate limiting across multiple servers using shared state (Redis, Memcached). Essential for microservices and load-balanced systems.

What Is API Rate Limiting?

API rate limiting is a technique that controls how many requests a client can make to an API within a defined time window. At its core, it is a counter: every time a client makes a request, a counter increments. When the counter reaches the limit, the API rejects further requests until the window resets. This simple mechanism protects your infrastructure from being overwhelmed, prevents any single client from monopolizing shared resources, and provides the foundation for usage-based billing.

Rate limiting serves three distinct purposes simultaneously. From a security perspective, it limits the blast radius of DDoS attacks, brute-force attempts on authentication endpoints, and scraping attacks — even if an attacker controls thousands of IPs, per-IP limits contain the damage per source. From a fairness perspective, it ensures that one customer with a runaway script cannot degrade the API experience for thousands of other customers on the same infrastructure. From a business perspective, it provides the enforcement mechanism for usage tiers — free users get 1,000 requests/day, paid users get 100,000, enterprise gets custom limits — creating a natural upgrade path.

Without rate limiting, a single misconfigured client — not even a malicious one — can bring down your entire API. A developer accidentally running a script in an infinite loop, a mobile app with a bug that retries every 100ms, or a batch job that parallelizes 500 simultaneous requests can exhaust your database connection pool, spike CPU, and create cascading failures affecting every other user on the platform. Rate limiting is not optional infrastructure — it is the circuit breaker that prevents these scenarios from becoming production incidents.

The challenge is implementing rate limiting in a way that is fair, transparent, and developer-friendly. Heavy-handed rate limiting that blocks legitimate traffic and provides no useful feedback creates a terrible developer experience. Well-designed rate limiting — with clear headers, informative error responses, and reasonable limits — is a feature rather than a constraint. Developers appreciate knowing their quota status and being able to design their applications to stay within limits proactively.

Rate Limiting Algorithms

Four main algorithms implement rate limiting, each with different trade-offs in memory usage, burst handling, and implementation complexity. Choosing the right one depends on your traffic patterns, infrastructure capabilities, and how strictly you need to enforce limits.

Algorithm	Complexity	Best For	Main Trade-off
Token Bucket	O(1) per request	APIs with variable traffic patterns and legitimate burst needs	Requires careful tuning of token generation rate
Sliding Window	O(n) per request	Small time windows (per-second or per-minute limits)	Memory intensive for large time windows or high request rates
Leaky Bucket	O(1) per request	Strict traffic shaping and bandwidth control	Can cause excessive queueing if overloaded
Fixed Window	O(1) per request	Simple implementations where boundary exploits are acceptable	Boundary condition exploits (spike at window edge)

Token Bucket is the most widely used algorithm for REST APIs. Tokens are added to a bucket at a fixed rate — say, 1.67 tokens per second for a 100-requests-per-minute limit. Each request consumes one token. When the bucket is full, excess tokens are discarded (the bucket has a maximum capacity). When the bucket is empty, the next request is rejected. The key advantage is that a client who has been idle accumulates tokens and can make several requests quickly — a legitimate mobile app that wakes from the background and needs to sync data. Token bucket is the right default for most APIs.

Sliding Window counts requests in a moving time range. For a 60-requests-per-minute limit with sliding window, the server counts how many requests the client made in the last 60 seconds from the current moment, not from the last minute boundary. This eliminates the boundary exploit where a client sends requests at 11:59:59 and 12:00:00 to get double quota. The downside is memory: you must store a timestamp for each request in the window, which can become significant at high request rates or long window durations.

Leaky Bucket processes requests at a constant rate, queuing excess requests. Unlike token bucket, it does not allow bursts — all requests flow out at the same fixed rate. This makes it suitable for traffic shaping to a downstream service that cannot handle bursts: you want requests to arrive at exactly 50/minute, not 100 in the first second and 0 for the next 59 seconds. The risk is queue buildup: if clients send requests faster than the leak rate indefinitely, the queue grows and requests are eventually dropped.

Fixed Window is the simplest: count requests in the current hour, reset the counter at the top of each hour. It is O(1) in time and space, trivially implemented with a Redis INCR and EXPIRE. Its weakness is the boundary exploit — a determined client can double their effective quota by making maximum requests at 11:59:59 and again at 12:00:00. For most public APIs where the threat model is misconfigured clients rather than adversarial exploitation, fixed window is acceptable. For security-sensitive endpoints (authentication, financial operations), always use sliding window or token bucket.

Rate Limiting Strategies

Rate limits can be applied at different granularities. Most production APIs combine multiple strategies simultaneously — per-IP, per-user, per-API-key, and global — because each catches a different attack vector that the others miss.

Strategy	Scope	Best Used For
Per-user rate limit	By authenticated user ID	SaaS platforms, member-only APIs
Per-IP rate limit	By client IP address	Public APIs, prevent DDoS from single IP
Per-API-key rate limit	By API key	Developer-friendly APIs, usage tiers
Per-endpoint rate limit	Per API route	Protect sensitive endpoints more strictly
Global rate limit	Total requests across all clients	Prevent server overload and cascade failures
Tiered rate limit	Different limits by subscription level	Freemium models, monetization

The recommended layered approach is to enforce all three simultaneously: a global limit that prevents server overload regardless of source, a per-IP limit that contains anonymous traffic and catches distributed attacks from a single botnet, and a per-API-key limit that enforces individual user quotas and supports tiered pricing. A user with a valid Pro API key gets 10,000 requests/hour while an anonymous IP gets 100 requests/hour — they operate independently, and hitting one limit does not change the other.

Per-endpoint limits deserve special attention for sensitive operations. Your search endpoint might allow 100 requests/minute, but your login endpoint should allow only 5 attempts/minute per IP — regardless of whether the user has an authenticated API key. SMS OTP endpoints, password reset endpoints, and financial transaction endpoints should all have their own aggressively conservative limits that operate independently from the global API quota. This prevents credential-stuffing attacks that exploit normal-looking traffic volumes at auth endpoints.

Rate Limit HTTP Headers

Rate limit information must be communicated to clients via HTTP response headers on every request, not just on 429 responses. This allows well-behaved clients to monitor their remaining quota and slow down proactively before hitting the limit — resulting in a better experience for your users and less 429 traffic for your servers.

Header	Type	Meaning	Example
RateLimit-Limit	Response	Total requests allowed in the current window	100
RateLimit-Remaining	Response	Requests remaining in the current window	42
RateLimit-Reset	Response	Unix timestamp when the limit resets	1767225600
X-RateLimit-Limit	Response	Legacy vendor-specific quota header	100
X-RateLimit-Remaining	Response	Legacy remaining requests header	42
X-RateLimit-Reset	Response	Legacy reset time header (seconds or Unix ts)	60
Retry-After	Response	Seconds to wait before retrying (sent with 429)	30
X-Forwarded-For	Request	Client IP when behind proxy/load balancer	203.0.113.42

The IETF has published a draft standard (draft-ietf-httpapi-ratelimit-headers) that defines RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset as the canonical header names. The older X-RateLimit-* variants (with the X- prefix) are vendor-specific extensions — they were introduced before the standard existed and are used by GitHub, Stripe, Twilio, and most major APIs. New APIs should implement the IETF standard headers and provide the X- variants as aliases for backwards compatibility with existing client SDKs.

When a 429 response is returned, always include Retry-After in seconds, telling the client exactly how long to wait. A response of Retry-After: 30 means the client should wait 30 seconds before its next request. Some APIs set this to the time until the window resets; others use a fixed backoff suggestion. Either is acceptable, but be consistent. Clients that respect Retry-After create dramatically less unnecessary traffic than clients that poll on a fixed interval.

Rate Limits in Real APIs

Understanding how major APIs implement rate limiting gives you benchmarks for designing your own limits and helps you build integrations that respect quotas. The approaches vary significantly in granularity, header conventions, and enforcement strategies.

API	Authenticated Limit	Notes
GitHub REST API	5,000 req/hour (authenticated)	Search API: 30 req/min. GraphQL: 5,000 points/hour. Uses X-RateLimit-* headers.
OpenAI API	Varies by model & tier	GPT-4: 10,000 TPM (tokens/min) on Tier 1. Adds x-ratelimit-limit-tokens and x-ratelimit-limit-requests.
Stripe API	100 req/sec (live mode)	Test mode is lower. Returns 429 with error.code = 'rate_limit'. Recommends exponential backoff.
Twilio	100 req/sec per account	Messaging API has separate SMS throughput limits. Returns 429 with Retry-After.
Twitter/X API	500,000 tweets/month (Basic)	Complex tiered system. Read vs write limits differ. 15-min rolling windows.
Cloudflare API	1,200 req/5 min	Zone analytics: 300 req/day. Uses X-RateLimit-* headers.

GitHub's rate limiting is one of the most well-designed in the industry: 5,000 requests/hour for authenticated requests, with the X-RateLimit-Remaining header on every response so clients always know their status. GitHub also uses a separate search API limit (30 requests/minute) and GraphQL point-based limits. Their secondary rate limits (triggered by high concurrency rather than request counts) use 429 with a Retry-After delay. GitHub's documentation is the gold standard for communicating rate limit policies to developers.

OpenAI's rate limiting is model-dependent and multi-dimensional: it limits both requests per minute (RPM) and tokens per minute (TPM). A request that processes a short prompt counts as 1 request but only a few hundred tokens; a request with a large context window counts as 1 request but tens of thousands of tokens. OpenAI returns both x-ratelimit-limit-requests and x-ratelimit-limit-tokens, allowing clients to track which dimension is the binding constraint. This multi-dimensional approach is appropriate whenever a single request has variable resource cost.

Stripe's 100 requests/second limit in live mode is generous for most applications, but Stripe is interesting because it returns error.type: "rate_limit_error" in the JSON body in addition to the 429 status code. Their official recommendation is exponential backoff with initial delay of 1 second. Stripe's test mode has lower limits, which is worth noting during development and load testing — your test environment behavior may not reflect what live mode can handle.

Implementation Patterns

Rate limiting can be implemented at different layers of your architecture, and the right choice depends on your infrastructure scale, latency requirements, and deployment model.

Local in-process

Rate limit state stored in application memory

Pros: Fast, no network latency

Use when: Small deployments, development environments

Redis-backed

Shared rate limit state in Redis with atomic INCR + TTL

Pros: Distributed, fast, supports all algorithms

Use when: Production microservices, multi-server deployments

API Gateway middleware

Rate limiting in reverse proxy or API gateway layer

Pros: Transparent to application, handles at the edge

Use when: Kubernetes + Kong, AWS API Gateway, Nginx

Database-backed

Rate limit counters stored in persistent database

Pros: Durable state, survives restarts

Use when: Legacy systems, database-centric architectures

Edge / CDN-level

Rate limiting at CDN edge nodes (Cloudflare Workers)

Pros: Lowest latency, closest to client, massive scale

Use when: Global APIs, DDoS protection, high-scale public APIs

Redis-Based Rate Limiting

Redis is the de facto standard for distributed rate limiting because it provides atomic operations, sub-millisecond latency, and native TTL support. The core pattern uses two Redis commands: INCR (atomic increment) and EXPIRE (set TTL on a key). The implementation is elegant and fast.

For a fixed window implementation, the Redis key encodes the client identifier and the current time window: rate:user:123:minute:202605211430 — user 123, minute window starting at 14:30 UTC on May 21, 2026. On each request, you call INCR on this key. If the key didn't exist before, also call EXPIRE with 60 seconds to ensure it cleans up automatically. If the returned value is greater than the limit, return 429. This entire operation should be wrapped in a Lua script that executes atomically on Redis to avoid race conditions between INCR and EXPIRE.

For sliding window in Redis, you use a sorted set. Each request adds a member with the current Unix timestamp as both member and score: ZADD rate:user:123 1716278400.123 1716278400.123. Before counting, remove all members older than the window: ZREMRANGEBYSCORE rate:user:123 0 (now-window). Then count remaining members with ZCARD. This approach stores every request timestamp and gives exact sliding window semantics, but uses more memory for high-volume APIs.

For token bucket in Redis, you store two values per client: the current token count and the last refill timestamp. On each request, calculate how many tokens have been added since the last refill (elapsed_time × refill_rate), add them to the stored count up to the bucket maximum, subtract one for the current request, and store the updated values. Because this involves read-modify-write, it must be wrapped in a Redis transaction (MULTI/EXEC) or Lua script to prevent race conditions in concurrent environments.

The key operational concern with Redis-backed rate limiting is what happens when Redis is unavailable. You have two choices: fail open (allow all requests when Redis is down, accepting that limits aren't enforced temporarily) or fail closed (reject all requests when Redis is down, accepting availability impact). For most APIs, fail open is correct — a short Redis outage should not cause a user-facing outage. For security-critical endpoints like authentication, fail closed is safer. Whichever you choose, alert immediately on Redis connectivity failures so the issue is resolved before it becomes significant.

Client-Side Rate Limit Handling

Well-designed API clients are as important as well-designed rate limiting. A client that doesn't handle 429 responses gracefully can amplify the problem: if 1,000 clients all hit the limit simultaneously and all retry after exactly the same delay, they create a synchronized spike that immediately exhausts the quota again. This is the thundering herd problem.

Exponential backoff is the standard solution. On the first 429, wait 1 second before retrying. On the second consecutive 429, wait 2 seconds. On the third, 4 seconds. On the fourth, 8 seconds, up to a maximum cap (typically 64 seconds or the Retry-After value, whichever is larger). The doubling ensures that a cluster of clients naturally spreads out — some retry sooner and succeed, reducing load, allowing later retriers to succeed with remaining quota.

Jitter enhances exponential backoff by adding a random variation to the delay. Instead of exactly 4 seconds, you wait between 2 and 6 seconds (4 seconds ± 50%). This prevents clients that started simultaneously from becoming synchronized again after backoff. The two common jitter strategies are full jitter (delay = random(0, base_delay)) and equal jitter (delay = base_delay/2 + random(0, base_delay/2)). AWS published research showing equal jitter provides the best balance of reduced load and reasonable latency distribution.

Proactive rate limiting is even better than reacting to 429s. Read the RateLimit-Remaining header on every response and slow down requests as the remaining count decreases. If remaining drops below 10% of the quota, reduce your request rate to spread the remaining quota over the full reset window. This avoids hitting the limit entirely, meaning no requests are ever rejected — the best user experience possible.

For batch processing jobs that need to make many API calls, implement a request queue with a rate limiter on the client side. Instead of sending all requests in parallel and relying on the server to reject the overflow, pre-limit your own send rate to stay within the quota. A token bucket on the client side, seeded with the API's quota, naturally paces requests at or below the limit without any 429 responses. This approach is significantly more efficient for batch jobs than the retry-on-rejection pattern.

Common Scenarios and Solutions

Rate limiting problems in production follow predictable patterns. Recognizing the scenario quickly leads you to the right solution.

Legitimate spikes from a mobile app

Challenge: Short bursts exceed limit but average volume is reasonable

→ Use token bucket with burst allowance — allows up to N requests instantly, then throttles to average rate

Bot or automated scraper

Challenge: Sustained high request rate from single source

→ Aggressive per-IP limit, CAPTCHA on anonymous endpoints, API key requirement, IP blocklist

Window boundary exploit

Challenge: Client sends max requests at 11:59:59 and again at 12:00:00, getting double quota

→ Replace fixed windows with sliding window or token bucket which have no hard reset boundaries

Microservice internal calls hitting public limit

Challenge: Service-to-service calls consume user-facing quota

→ Separate API keys with elevated limits for internal traffic; IP-allowlist internal subnets

Distributed DDoS with many IPs

Challenge: Attack spreads across many IPs, per-IP limit is ineffective

→ Global rate limit, geographic filtering, WAF, anomaly detection on request patterns

Legitimate developer hitting limit during testing

Challenge: Developer stress-tests integration and exhausts daily quota

→ Separate sandbox/test environment with higher limits; provide quota increase request process

Rate limit state not synced across servers

Challenge: Three servers each allow 100 req/min, effectively giving 300 req/min total

→ Centralize state in Redis; use atomic INCR + EXPIRE to maintain single source of truth

Thundering herd after service recovery

Challenge: All clients retry simultaneously after an outage, causing immediate overload again

→ Retry-After header + client-side exponential backoff with jitter spreads retries over time

Debugging Rate Limit Issues

When rate limiting behaves unexpectedly, systematic diagnosis beats guessing. Most issues fall into one of these patterns.

🔍 All requests return 429

Likely cause: Global rate limit hit, Redis down, or misconfigured limit value

What to check: Check total traffic volume, Redis connection health, and limit configuration value

🔍 Only one IP getting 429

Likely cause: Per-IP limit is too low for that client's traffic pattern

What to check: Verify IP identification (check X-Forwarded-For), review per-IP limit value

🔍 Rate limit resets at wrong time

Likely cause: Limit window duration misconfigured or server clock skew

What to check: Verify RateLimit-Reset timestamp, check server NTP synchronization

🔍 Rate limit works locally not in production

Likely cause: Distributed state not synced, or limits configured differently per environment

What to check: Confirm Redis connection in production, compare config across environments

🔍 429 but RateLimit-Remaining shows quota available

Likely cause: Algorithm error, Redis state stale, or per-endpoint limit hit that differs from global

What to check: Check specific endpoint limits, inspect Redis data, verify clock sync

🔍 Clients not respecting Retry-After

Likely cause: Client library outdated, Retry-After format incorrect, or client ignoring headers

What to check: Verify Retry-After format (should be seconds integer), update client SDK

🔍 Rate limit check latency is high

Likely cause: Redis latency, network partition, or slow synchronous rate-limit check code

What to check: Monitor Redis P99 response times, check network latency, consider pipelining

🔍 Limit enforcement inconsistent across requests

Likely cause: Multiple servers with local in-memory state, or Redis replication lag

What to check: Confirm all servers connect to the same Redis, check replication status

Monitoring and Alerting

Proactive monitoring detects rate limiting issues before they impact users. These are the metrics that matter most in production.

Metric	Healthy Target	Action if Off
Rate limit hit rate	< 5% for healthy APIs	If high, investigate traffic patterns or adjust limits
429 responses per endpoint	Should correlate with expected traffic	Monitor for unusual spikes or legitimate user patterns
Retry success rate	> 80% for legitimate clients	If low, increase Retry-After window or adjust limits
Rate limit check latency	< 1ms per request check	Optimize Redis access, use pipelining, check network
Distributed sync delay	< 100ms	Check Redis latency and network conditions
False positive rate	< 1%	Review limit thresholds and client traffic patterns

Best Practices

Always send rate-limit headers

Include RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset in every response, not just 429s

Impact: Enables proactive client backoff before hitting the limit

Include Retry-After on 429

Tell clients exactly how long to wait. Use seconds format for simplicity.

Impact: Reduces retry storms and wasted request attempts

Use tiered limits

Different limits for free, paid, and enterprise tiers. Separate limits for internal services.

Impact: Monetizes API access and protects premium users from free-tier abuse

Layer per-IP and per-key limits

Enforce both simultaneously — per-IP catches anonymous abuse, per-key enforces user quotas

Impact: Defense-in-depth against varied attack patterns

Use Redis for distributed state

Atomic INCR with EXPIRE ensures accurate counts across all server instances

Impact: Prevents bypassing limits by targeting different servers

Log every 429 response

Record client IP, user ID, endpoint, timestamp, and request count for each rejection

Impact: Early detection of attacks, quota tuning data, client debugging

Whitelist internal services

Use separate higher-limit API keys for service-to-service calls

Impact: Prevents internal traffic from triggering limits designed for external clients

Provide clear error messages

429 body should state the limit, when it resets, and link to docs

Impact: Reduces support load and developer frustration significantly

Common Mistakes to Avoid

❌ Fixed window with boundary exploit

Problem: Client requests at window boundaries (11:59 and 12:00) to get double quota

Fix: Use sliding window or token bucket instead of hard resets

❌ No distributed rate limit state

Problem: Each server tracks limits independently; client calls different servers and bypasses limits entirely

Fix: Use Redis INCR + EXPIRE for shared counter across all instances

❌ Limiting by proxy IP

Problem: Limits enforce on load balancer IP instead of real client IP, allowing any single client unlimited access

Fix: Extract real client IP from X-Forwarded-For or CF-Connecting-IP headers

❌ No RateLimit headers on non-429 responses

Problem: Clients don't see quota status and retry too early, wasting requests on every call

Fix: Always include RateLimit-* headers in every API response, not just errors

❌ Limits too aggressive for real usage

Problem: Legitimate applications exceed limits during normal page loads or batch jobs

Fix: Analyze real traffic histograms before setting limits; provide burst allowances

❌ No monitoring of 429 spikes

Problem: DDoS or abuse goes unnoticed until damage is severe or service is degraded

Fix: Alert on any spike in 429 response rate exceeding baseline

❌ Unclear 429 error messages

Problem: Clients don't know why they're rate-limited, which limit was hit, or when to retry

Fix: Return structured JSON with limit type, current count, reset time, and retry guidance

❌ Same limits for internal and external traffic

Problem: Microservice calls hit public rate limits, causing cascading failures inside your own infrastructure

Fix: Separate API key tiers with elevated limits for internal service-to-service calls

Worked Examples

Token Bucket — 100 requests/min with 10 req/sec burst

Inputs

100 tokens • 1.67 tokens/sec (100/60) • 10 tokens • 1 token per request

Scenario

Client makes 15 requests in the first second. Token bucket has 10 tokens available (burst limit). First 10 succeed immediately. Requests 11–15 are rejected with 429. After 0.6 seconds, 1 more token arrives, allowing request 11.

Calculation

Capacity = 100, Refill = 100 ÷ 60 = 1.67 tokens/sec. Burst = 10. After initial burst, client must space requests at least 600ms apart to stay within refill rate.

Sliding Window — 60 requests per minute

Inputs

60 seconds • 60 requests • 12:34:50 UTC

Scenario

At 12:33:50 UTC, client made 40 requests. At 12:34:50 UTC, 25 more requests arrive. The window [12:33:50, 12:34:50] already has 40 requests. Only 20 more are allowed before the limit is hit.

Calculation

Window = last 60 seconds from now. Count = 40 existing. Remaining = 60 − 40 = 20. RateLimit-Reset = 12:34:50 UTC (when first request in window ages out at 12:34:50 + 0s).

Tiered limits — Free vs Pro user

Inputs

1,000 requests/day • 100,000 requests/day • 24 hours from first request

Scenario

Free user has made 950 requests today. They receive RateLimit-Remaining: 50. After 50 more requests, they get 429 with Retry-After pointing to their 24-hour reset. Pro user at 80,000 requests has 20,000 remaining.

Calculation

Remaining = tier_quota − used_today. Reset = user_start_of_day + 86400 seconds. Each response includes current remaining count so clients can pace themselves.

Frequently Asked Questions

What is the difference between rate limiting and throttling?

Rate limiting rejects requests when quota is exceeded — it returns HTTP 429 immediately. Throttling delays requests to spread them over time, queuing them rather than rejecting. Throttling is gentler on legitimate clients because their requests eventually succeed; rate limiting is absolute. Many production systems use both: throttle short legitimate spikes to avoid false 429s, then rate-limit sustained abuse that throttling can't contain.

Which rate-limiting algorithm should I use?

Token bucket is most popular for general-purpose APIs because it handles burst traffic naturally: a client that has been idle accumulates tokens and can make several requests quickly, then gets throttled to the average rate. This matches how real applications behave. Use sliding window when you need precise enforcement with no boundary effects. Use leaky bucket when you need strict, predictable output rates to a downstream service. For most REST APIs, token bucket with per-user and per-IP tiers is the right starting point.

How do I handle rate limits in a distributed system?

Use Redis with atomic INCR + EXPIRE. When a request arrives, execute INCR on a key like rate:user:123:minute:202605211500 and set its TTL to your window duration. All servers read and write from the same Redis instance, ensuring the count is accurate regardless of which server handles a given request. For high-traffic APIs, use Redis Cluster or Redis Sentinel for availability. A Lua script running INCR + EXPIRE atomically prevents race conditions.

What HTTP status code should I return when rate-limiting?

Always HTTP 429 Too Many Requests. Include RateLimit-Remaining: 0, RateLimit-Reset: [unix timestamp], and Retry-After: [seconds until reset] in the response headers. In the JSON body, include an error code, a human-readable message explaining which limit was exceeded, and a link to your rate-limiting documentation. Never return 503 for rate limiting — that signals server unavailability rather than client quota exhaustion.

What is exponential backoff and why is it important?

Exponential backoff is a retry strategy where the wait time doubles with each failed attempt: 1 second after the first failure, 2 after the second, 4 after the third, 8 after the fourth, up to a maximum cap (e.g., 64 seconds). It prevents the thundering herd problem: if 1,000 clients all hit a rate limit simultaneously and all retry after exactly 30 seconds, they create another spike at T+30 that immediately exhausts the quota again. Exponential backoff with random jitter (multiplying the delay by a random factor 0.5–1.5) spreads retries across a time range, giving the API time to recover.

How do I prevent rate-limit boundary exploits with fixed windows?

Fixed windows reset at exact boundaries (e.g., every hour at :00). A client can exploit this by sending 1,000 requests at 12:59:59 and another 1,000 at 13:00:00, effectively sending 2,000 requests in 2 seconds. Sliding windows eliminate this because they count requests in the last N seconds from now, not from a fixed clock boundary. Token bucket is even better: the burst capacity is bounded by the bucket size, not by when the clock ticks over.

Can rate limits be bypassed?

Yes, in several ways: (1) If you use in-process memory instead of shared Redis, clients can hit different servers and bypass limits entirely. (2) If you identify clients by load-balancer IP instead of the real client IP from X-Forwarded-For, all clients appear as one IP with no per-client limit. (3) Fixed windows allow the boundary exploit. (4) If your Redis TTLs are misconfigured, counters may expire early, effectively resetting limits. Mitigation: always use Redis for distributed state, extract real IPs correctly, use sliding windows or token bucket.

How do I monitor if rate limits are working?

Track these metrics: total requests per minute, 429 response rate (alert if > 5% of traffic), 429 distribution by IP and user (sudden spikes indicate abuse), Retry-After header accuracy (do clients actually back off?), and false positive rate (are any legitimate users hitting limits during normal operations?). Set alerts for: any IP generating > 10% of total 429s in a 5-minute window, 429 rate spiking > 3x baseline, and Redis connection errors causing fail-open behavior.

Should internal service-to-service calls be rate-limited?

Internal calls should have different limits, not be exempt. Give each internal service its own API key with elevated quotas that reflect its actual traffic needs. This prevents a misbehaving internal service from consuming the entire quota and taking down external-facing functionality. It also gives you visibility into internal call volumes. Never use the same rate limit tier for both public clients and internal microservices — the traffic profiles are completely different.

What is the IETF RateLimit header standard?

IETF draft RFC draft-ietf-httpapi-ratelimit-headers defines standardized header names: RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset, and RateLimit-Policy. The older X-RateLimit-* variants (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) are vendor-specific extensions without a standard. GitHub, Stripe, and most major APIs currently use the X- variants but are migrating to the IETF standard. New APIs should implement the IETF standard headers and provide X- variants as aliases for backwards compatibility.

How do I set the right rate limit values for my API?

Start by measuring, not guessing. Collect real traffic data for one to two weeks across different user cohorts. Find the P95 request rate for your typical user — this becomes your base limit. Set your limit at 2–3x P95 to give headroom for legitimate spikes without allowing runaway clients. For sensitive endpoints like login, authentication, or password reset, use much lower limits (5–10 requests per minute) regardless of traffic data. Revisit limits every quarter and adjust based on actual 429 rates and customer feedback.