API Rate Limiting Explained
The complete reference for API rate limiting — algorithms, HTTP headers, Redis implementation, real-world limits from major APIs, debugging techniques, and production best practices.
TL;DR — Key Points
What Is API Rate Limiting?
API rate limiting is a technique that controls how many requests a client can make to an API within a defined time window. At its core, it is a counter: every time a client makes a request, a counter increments. When the counter reaches the limit, the API rejects further requests until the window resets. This simple mechanism protects your infrastructure from being overwhelmed, prevents any single client from monopolizing shared resources, and provides the foundation for usage-based billing.
Rate limiting serves three distinct purposes simultaneously. From a security perspective, it limits the blast radius of DDoS attacks, brute-force attempts on authentication endpoints, and scraping attacks — even if an attacker controls thousands of IPs, per-IP limits contain the damage per source. From a fairness perspective, it ensures that one customer with a runaway script cannot degrade the API experience for thousands of other customers on the same infrastructure. From a business perspective, it provides the enforcement mechanism for usage tiers — free users get 1,000 requests/day, paid users get 100,000, enterprise gets custom limits — creating a natural upgrade path.
Without rate limiting, a single misconfigured client — not even a malicious one — can bring down your entire API. A developer accidentally running a script in an infinite loop, a mobile app with a bug that retries every 100ms, or a batch job that parallelizes 500 simultaneous requests can exhaust your database connection pool, spike CPU, and create cascading failures affecting every other user on the platform. Rate limiting is not optional infrastructure — it is the circuit breaker that prevents these scenarios from becoming production incidents.
The challenge is implementing rate limiting in a way that is fair, transparent, and developer-friendly. Heavy-handed rate limiting that blocks legitimate traffic and provides no useful feedback creates a terrible developer experience. Well-designed rate limiting — with clear headers, informative error responses, and reasonable limits — is a feature rather than a constraint. Developers appreciate knowing their quota status and being able to design their applications to stay within limits proactively.
Rate Limiting Algorithms
Four main algorithms implement rate limiting, each with different trade-offs in memory usage, burst handling, and implementation complexity. Choosing the right one depends on your traffic patterns, infrastructure capabilities, and how strictly you need to enforce limits.
| Algorithm | Complexity | Best For | Main Trade-off |
|---|---|---|---|
| Token Bucket | O(1) per request | APIs with variable traffic patterns and legitimate burst needs | Requires careful tuning of token generation rate |
| Sliding Window | O(n) per request | Small time windows (per-second or per-minute limits) | Memory intensive for large time windows or high request rates |
| Leaky Bucket | O(1) per request | Strict traffic shaping and bandwidth control | Can cause excessive queueing if overloaded |
| Fixed Window | O(1) per request | Simple implementations where boundary exploits are acceptable | Boundary condition exploits (spike at window edge) |
Token Bucket is the most widely used algorithm for REST APIs. Tokens are added to a bucket at a fixed rate — say, 1.67 tokens per second for a 100-requests-per-minute limit. Each request consumes one token. When the bucket is full, excess tokens are discarded (the bucket has a maximum capacity). When the bucket is empty, the next request is rejected. The key advantage is that a client who has been idle accumulates tokens and can make several requests quickly — a legitimate mobile app that wakes from the background and needs to sync data. Token bucket is the right default for most APIs.
Sliding Window counts requests in a moving time range. For a 60-requests-per-minute limit with sliding window, the server counts how many requests the client made in the last 60 seconds from the current moment, not from the last minute boundary. This eliminates the boundary exploit where a client sends requests at 11:59:59 and 12:00:00 to get double quota. The downside is memory: you must store a timestamp for each request in the window, which can become significant at high request rates or long window durations.
Leaky Bucket processes requests at a constant rate, queuing excess requests. Unlike token bucket, it does not allow bursts — all requests flow out at the same fixed rate. This makes it suitable for traffic shaping to a downstream service that cannot handle bursts: you want requests to arrive at exactly 50/minute, not 100 in the first second and 0 for the next 59 seconds. The risk is queue buildup: if clients send requests faster than the leak rate indefinitely, the queue grows and requests are eventually dropped.
Fixed Window is the simplest: count requests in the current hour, reset the counter at the top of each hour. It is O(1) in time and space, trivially implemented with a Redis INCR and EXPIRE. Its weakness is the boundary exploit — a determined client can double their effective quota by making maximum requests at 11:59:59 and again at 12:00:00. For most public APIs where the threat model is misconfigured clients rather than adversarial exploitation, fixed window is acceptable. For security-sensitive endpoints (authentication, financial operations), always use sliding window or token bucket.
Rate Limiting Strategies
Rate limits can be applied at different granularities. Most production APIs combine multiple strategies simultaneously — per-IP, per-user, per-API-key, and global — because each catches a different attack vector that the others miss.
| Strategy | Scope | Best Used For |
|---|---|---|
| Per-user rate limit | By authenticated user ID | SaaS platforms, member-only APIs |
| Per-IP rate limit | By client IP address | Public APIs, prevent DDoS from single IP |
| Per-API-key rate limit | By API key | Developer-friendly APIs, usage tiers |
| Per-endpoint rate limit | Per API route | Protect sensitive endpoints more strictly |
| Global rate limit | Total requests across all clients | Prevent server overload and cascade failures |
| Tiered rate limit | Different limits by subscription level | Freemium models, monetization |
The recommended layered approach is to enforce all three simultaneously: a global limit that prevents server overload regardless of source, a per-IP limit that contains anonymous traffic and catches distributed attacks from a single botnet, and a per-API-key limit that enforces individual user quotas and supports tiered pricing. A user with a valid Pro API key gets 10,000 requests/hour while an anonymous IP gets 100 requests/hour — they operate independently, and hitting one limit does not change the other.
Per-endpoint limits deserve special attention for sensitive operations. Your search endpoint might allow 100 requests/minute, but your login endpoint should allow only 5 attempts/minute per IP — regardless of whether the user has an authenticated API key. SMS OTP endpoints, password reset endpoints, and financial transaction endpoints should all have their own aggressively conservative limits that operate independently from the global API quota. This prevents credential-stuffing attacks that exploit normal-looking traffic volumes at auth endpoints.
Rate Limit HTTP Headers
Rate limit information must be communicated to clients via HTTP response headers on every request, not just on 429 responses. This allows well-behaved clients to monitor their remaining quota and slow down proactively before hitting the limit — resulting in a better experience for your users and less 429 traffic for your servers.
| Header | Type | Meaning | Example |
|---|---|---|---|
| RateLimit-Limit | Response | Total requests allowed in the current window | 100 |
| RateLimit-Remaining | Response | Requests remaining in the current window | 42 |
| RateLimit-Reset | Response | Unix timestamp when the limit resets | 1767225600 |
| X-RateLimit-Limit | Response | Legacy vendor-specific quota header | 100 |
| X-RateLimit-Remaining | Response | Legacy remaining requests header | 42 |
| X-RateLimit-Reset | Response | Legacy reset time header (seconds or Unix ts) | 60 |
| Retry-After | Response | Seconds to wait before retrying (sent with 429) | 30 |
| X-Forwarded-For | Request | Client IP when behind proxy/load balancer | 203.0.113.42 |
The IETF has published a draft standard (draft-ietf-httpapi-ratelimit-headers) that defines RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset as the canonical header names. The older X-RateLimit-* variants (with the X- prefix) are vendor-specific extensions — they were introduced before the standard existed and are used by GitHub, Stripe, Twilio, and most major APIs. New APIs should implement the IETF standard headers and provide the X- variants as aliases for backwards compatibility with existing client SDKs.
When a 429 response is returned, always include Retry-After in seconds, telling the client exactly how long to wait. A response of Retry-After: 30 means the client should wait 30 seconds before its next request. Some APIs set this to the time until the window resets; others use a fixed backoff suggestion. Either is acceptable, but be consistent. Clients that respect Retry-After create dramatically less unnecessary traffic than clients that poll on a fixed interval.
Rate Limits in Real APIs
Understanding how major APIs implement rate limiting gives you benchmarks for designing your own limits and helps you build integrations that respect quotas. The approaches vary significantly in granularity, header conventions, and enforcement strategies.
| API | Authenticated Limit | Notes |
|---|---|---|
| GitHub REST API | 5,000 req/hour (authenticated) | Search API: 30 req/min. GraphQL: 5,000 points/hour. Uses X-RateLimit-* headers. |
| OpenAI API | Varies by model & tier | GPT-4: 10,000 TPM (tokens/min) on Tier 1. Adds x-ratelimit-limit-tokens and x-ratelimit-limit-requests. |
| Stripe API | 100 req/sec (live mode) | Test mode is lower. Returns 429 with error.code = 'rate_limit'. Recommends exponential backoff. |
| Twilio | 100 req/sec per account | Messaging API has separate SMS throughput limits. Returns 429 with Retry-After. |
| Twitter/X API | 500,000 tweets/month (Basic) | Complex tiered system. Read vs write limits differ. 15-min rolling windows. |
| Cloudflare API | 1,200 req/5 min | Zone analytics: 300 req/day. Uses X-RateLimit-* headers. |
GitHub's rate limiting is one of the most well-designed in the industry: 5,000 requests/hour for authenticated requests, with the X-RateLimit-Remaining header on every response so clients always know their status. GitHub also uses a separate search API limit (30 requests/minute) and GraphQL point-based limits. Their secondary rate limits (triggered by high concurrency rather than request counts) use 429 with a Retry-After delay. GitHub's documentation is the gold standard for communicating rate limit policies to developers.
OpenAI's rate limiting is model-dependent and multi-dimensional: it limits both requests per minute (RPM) and tokens per minute (TPM). A request that processes a short prompt counts as 1 request but only a few hundred tokens; a request with a large context window counts as 1 request but tens of thousands of tokens. OpenAI returns both x-ratelimit-limit-requests and x-ratelimit-limit-tokens, allowing clients to track which dimension is the binding constraint. This multi-dimensional approach is appropriate whenever a single request has variable resource cost.
Stripe's 100 requests/second limit in live mode is generous for most applications, but Stripe is interesting because it returns error.type: "rate_limit_error" in the JSON body in addition to the 429 status code. Their official recommendation is exponential backoff with initial delay of 1 second. Stripe's test mode has lower limits, which is worth noting during development and load testing — your test environment behavior may not reflect what live mode can handle.
Implementation Patterns
Rate limiting can be implemented at different layers of your architecture, and the right choice depends on your infrastructure scale, latency requirements, and deployment model.
Local in-process
Rate limit state stored in application memory
Redis-backed
Shared rate limit state in Redis with atomic INCR + TTL
API Gateway middleware
Rate limiting in reverse proxy or API gateway layer
Database-backed
Rate limit counters stored in persistent database
Edge / CDN-level
Rate limiting at CDN edge nodes (Cloudflare Workers)
Redis-Based Rate Limiting
Redis is the de facto standard for distributed rate limiting because it provides atomic operations, sub-millisecond latency, and native TTL support. The core pattern uses two Redis commands: INCR (atomic increment) and EXPIRE (set TTL on a key). The implementation is elegant and fast.
For a fixed window implementation, the Redis key encodes the client identifier and the current time window: rate:user:123:minute:202605211430 — user 123, minute window starting at 14:30 UTC on May 21, 2026. On each request, you call INCR on this key. If the key didn't exist before, also call EXPIRE with 60 seconds to ensure it cleans up automatically. If the returned value is greater than the limit, return 429. This entire operation should be wrapped in a Lua script that executes atomically on Redis to avoid race conditions between INCR and EXPIRE.
For sliding window in Redis, you use a sorted set. Each request adds a member with the current Unix timestamp as both member and score: ZADD rate:user:123 1716278400.123 1716278400.123. Before counting, remove all members older than the window: ZREMRANGEBYSCORE rate:user:123 0 (now-window). Then count remaining members with ZCARD. This approach stores every request timestamp and gives exact sliding window semantics, but uses more memory for high-volume APIs.
For token bucket in Redis, you store two values per client: the current token count and the last refill timestamp. On each request, calculate how many tokens have been added since the last refill (elapsed_time × refill_rate), add them to the stored count up to the bucket maximum, subtract one for the current request, and store the updated values. Because this involves read-modify-write, it must be wrapped in a Redis transaction (MULTI/EXEC) or Lua script to prevent race conditions in concurrent environments.
The key operational concern with Redis-backed rate limiting is what happens when Redis is unavailable. You have two choices: fail open (allow all requests when Redis is down, accepting that limits aren't enforced temporarily) or fail closed (reject all requests when Redis is down, accepting availability impact). For most APIs, fail open is correct — a short Redis outage should not cause a user-facing outage. For security-critical endpoints like authentication, fail closed is safer. Whichever you choose, alert immediately on Redis connectivity failures so the issue is resolved before it becomes significant.
Client-Side Rate Limit Handling
Well-designed API clients are as important as well-designed rate limiting. A client that doesn't handle 429 responses gracefully can amplify the problem: if 1,000 clients all hit the limit simultaneously and all retry after exactly the same delay, they create a synchronized spike that immediately exhausts the quota again. This is the thundering herd problem.
Exponential backoff is the standard solution. On the first 429, wait 1 second before retrying. On the second consecutive 429, wait 2 seconds. On the third, 4 seconds. On the fourth, 8 seconds, up to a maximum cap (typically 64 seconds or the Retry-After value, whichever is larger). The doubling ensures that a cluster of clients naturally spreads out — some retry sooner and succeed, reducing load, allowing later retriers to succeed with remaining quota.
Jitter enhances exponential backoff by adding a random variation to the delay. Instead of exactly 4 seconds, you wait between 2 and 6 seconds (4 seconds ± 50%). This prevents clients that started simultaneously from becoming synchronized again after backoff. The two common jitter strategies are full jitter (delay = random(0, base_delay)) and equal jitter (delay = base_delay/2 + random(0, base_delay/2)). AWS published research showing equal jitter provides the best balance of reduced load and reasonable latency distribution.
Proactive rate limiting is even better than reacting to 429s. Read the RateLimit-Remaining header on every response and slow down requests as the remaining count decreases. If remaining drops below 10% of the quota, reduce your request rate to spread the remaining quota over the full reset window. This avoids hitting the limit entirely, meaning no requests are ever rejected — the best user experience possible.
For batch processing jobs that need to make many API calls, implement a request queue with a rate limiter on the client side. Instead of sending all requests in parallel and relying on the server to reject the overflow, pre-limit your own send rate to stay within the quota. A token bucket on the client side, seeded with the API's quota, naturally paces requests at or below the limit without any 429 responses. This approach is significantly more efficient for batch jobs than the retry-on-rejection pattern.
Common Scenarios and Solutions
Rate limiting problems in production follow predictable patterns. Recognizing the scenario quickly leads you to the right solution.
Legitimate spikes from a mobile app
Challenge: Short bursts exceed limit but average volume is reasonable
→ Use token bucket with burst allowance — allows up to N requests instantly, then throttles to average rate
Bot or automated scraper
Challenge: Sustained high request rate from single source
→ Aggressive per-IP limit, CAPTCHA on anonymous endpoints, API key requirement, IP blocklist
Window boundary exploit
Challenge: Client sends max requests at 11:59:59 and again at 12:00:00, getting double quota
→ Replace fixed windows with sliding window or token bucket which have no hard reset boundaries
Microservice internal calls hitting public limit
Challenge: Service-to-service calls consume user-facing quota
→ Separate API keys with elevated limits for internal traffic; IP-allowlist internal subnets
Distributed DDoS with many IPs
Challenge: Attack spreads across many IPs, per-IP limit is ineffective
→ Global rate limit, geographic filtering, WAF, anomaly detection on request patterns
Legitimate developer hitting limit during testing
Challenge: Developer stress-tests integration and exhausts daily quota
→ Separate sandbox/test environment with higher limits; provide quota increase request process
Rate limit state not synced across servers
Challenge: Three servers each allow 100 req/min, effectively giving 300 req/min total
→ Centralize state in Redis; use atomic INCR + EXPIRE to maintain single source of truth
Thundering herd after service recovery
Challenge: All clients retry simultaneously after an outage, causing immediate overload again
→ Retry-After header + client-side exponential backoff with jitter spreads retries over time
Debugging Rate Limit Issues
When rate limiting behaves unexpectedly, systematic diagnosis beats guessing. Most issues fall into one of these patterns.
🔍 All requests return 429
Likely cause: Global rate limit hit, Redis down, or misconfigured limit value
What to check: Check total traffic volume, Redis connection health, and limit configuration value
🔍 Only one IP getting 429
Likely cause: Per-IP limit is too low for that client's traffic pattern
What to check: Verify IP identification (check X-Forwarded-For), review per-IP limit value
🔍 Rate limit resets at wrong time
Likely cause: Limit window duration misconfigured or server clock skew
What to check: Verify RateLimit-Reset timestamp, check server NTP synchronization
🔍 Rate limit works locally not in production
Likely cause: Distributed state not synced, or limits configured differently per environment
What to check: Confirm Redis connection in production, compare config across environments
🔍 429 but RateLimit-Remaining shows quota available
Likely cause: Algorithm error, Redis state stale, or per-endpoint limit hit that differs from global
What to check: Check specific endpoint limits, inspect Redis data, verify clock sync
🔍 Clients not respecting Retry-After
Likely cause: Client library outdated, Retry-After format incorrect, or client ignoring headers
What to check: Verify Retry-After format (should be seconds integer), update client SDK
🔍 Rate limit check latency is high
Likely cause: Redis latency, network partition, or slow synchronous rate-limit check code
What to check: Monitor Redis P99 response times, check network latency, consider pipelining
🔍 Limit enforcement inconsistent across requests
Likely cause: Multiple servers with local in-memory state, or Redis replication lag
What to check: Confirm all servers connect to the same Redis, check replication status
Monitoring and Alerting
Proactive monitoring detects rate limiting issues before they impact users. These are the metrics that matter most in production.
| Metric | Healthy Target | Action if Off |
|---|---|---|
| Rate limit hit rate | < 5% for healthy APIs | If high, investigate traffic patterns or adjust limits |
| 429 responses per endpoint | Should correlate with expected traffic | Monitor for unusual spikes or legitimate user patterns |
| Retry success rate | > 80% for legitimate clients | If low, increase Retry-After window or adjust limits |
| Rate limit check latency | < 1ms per request check | Optimize Redis access, use pipelining, check network |
| Distributed sync delay | < 100ms | Check Redis latency and network conditions |
| False positive rate | < 1% | Review limit thresholds and client traffic patterns |
Best Practices
Always send rate-limit headers
Include RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset in every response, not just 429s
Impact: Enables proactive client backoff before hitting the limit
Include Retry-After on 429
Tell clients exactly how long to wait. Use seconds format for simplicity.
Impact: Reduces retry storms and wasted request attempts
Use tiered limits
Different limits for free, paid, and enterprise tiers. Separate limits for internal services.
Impact: Monetizes API access and protects premium users from free-tier abuse
Layer per-IP and per-key limits
Enforce both simultaneously — per-IP catches anonymous abuse, per-key enforces user quotas
Impact: Defense-in-depth against varied attack patterns
Use Redis for distributed state
Atomic INCR with EXPIRE ensures accurate counts across all server instances
Impact: Prevents bypassing limits by targeting different servers
Log every 429 response
Record client IP, user ID, endpoint, timestamp, and request count for each rejection
Impact: Early detection of attacks, quota tuning data, client debugging
Whitelist internal services
Use separate higher-limit API keys for service-to-service calls
Impact: Prevents internal traffic from triggering limits designed for external clients
Provide clear error messages
429 body should state the limit, when it resets, and link to docs
Impact: Reduces support load and developer frustration significantly
Common Mistakes to Avoid
❌ Fixed window with boundary exploit
Problem: Client requests at window boundaries (11:59 and 12:00) to get double quota
Fix: Use sliding window or token bucket instead of hard resets
❌ No distributed rate limit state
Problem: Each server tracks limits independently; client calls different servers and bypasses limits entirely
Fix: Use Redis INCR + EXPIRE for shared counter across all instances
❌ Limiting by proxy IP
Problem: Limits enforce on load balancer IP instead of real client IP, allowing any single client unlimited access
Fix: Extract real client IP from X-Forwarded-For or CF-Connecting-IP headers
❌ No RateLimit headers on non-429 responses
Problem: Clients don't see quota status and retry too early, wasting requests on every call
Fix: Always include RateLimit-* headers in every API response, not just errors
❌ Limits too aggressive for real usage
Problem: Legitimate applications exceed limits during normal page loads or batch jobs
Fix: Analyze real traffic histograms before setting limits; provide burst allowances
❌ No monitoring of 429 spikes
Problem: DDoS or abuse goes unnoticed until damage is severe or service is degraded
Fix: Alert on any spike in 429 response rate exceeding baseline
❌ Unclear 429 error messages
Problem: Clients don't know why they're rate-limited, which limit was hit, or when to retry
Fix: Return structured JSON with limit type, current count, reset time, and retry guidance
❌ Same limits for internal and external traffic
Problem: Microservice calls hit public rate limits, causing cascading failures inside your own infrastructure
Fix: Separate API key tiers with elevated limits for internal service-to-service calls
Worked Examples
Token Bucket — 100 requests/min with 10 req/sec burst
Inputs
100 tokens • 1.67 tokens/sec (100/60) • 10 tokens • 1 token per request
Scenario
Client makes 15 requests in the first second. Token bucket has 10 tokens available (burst limit). First 10 succeed immediately. Requests 11–15 are rejected with 429. After 0.6 seconds, 1 more token arrives, allowing request 11.
Calculation
Capacity = 100, Refill = 100 ÷ 60 = 1.67 tokens/sec. Burst = 10. After initial burst, client must space requests at least 600ms apart to stay within refill rate.
Sliding Window — 60 requests per minute
Inputs
60 seconds • 60 requests • 12:34:50 UTC
Scenario
At 12:33:50 UTC, client made 40 requests. At 12:34:50 UTC, 25 more requests arrive. The window [12:33:50, 12:34:50] already has 40 requests. Only 20 more are allowed before the limit is hit.
Calculation
Window = last 60 seconds from now. Count = 40 existing. Remaining = 60 − 40 = 20. RateLimit-Reset = 12:34:50 UTC (when first request in window ages out at 12:34:50 + 0s).
Tiered limits — Free vs Pro user
Inputs
1,000 requests/day • 100,000 requests/day • 24 hours from first request
Scenario
Free user has made 950 requests today. They receive RateLimit-Remaining: 50. After 50 more requests, they get 429 with Retry-After pointing to their 24-hour reset. Pro user at 80,000 requests has 20,000 remaining.
Calculation
Remaining = tier_quota − used_today. Reset = user_start_of_day + 86400 seconds. Each response includes current remaining count so clients can pace themselves.
Frequently Asked Questions
What is the difference between rate limiting and throttling?
Rate limiting rejects requests when quota is exceeded — it returns HTTP 429 immediately. Throttling delays requests to spread them over time, queuing them rather than rejecting. Throttling is gentler on legitimate clients because their requests eventually succeed; rate limiting is absolute. Many production systems use both: throttle short legitimate spikes to avoid false 429s, then rate-limit sustained abuse that throttling can't contain.
Which rate-limiting algorithm should I use?
Token bucket is most popular for general-purpose APIs because it handles burst traffic naturally: a client that has been idle accumulates tokens and can make several requests quickly, then gets throttled to the average rate. This matches how real applications behave. Use sliding window when you need precise enforcement with no boundary effects. Use leaky bucket when you need strict, predictable output rates to a downstream service. For most REST APIs, token bucket with per-user and per-IP tiers is the right starting point.
How do I handle rate limits in a distributed system?
Use Redis with atomic INCR + EXPIRE. When a request arrives, execute INCR on a key like rate:user:123:minute:202605211500 and set its TTL to your window duration. All servers read and write from the same Redis instance, ensuring the count is accurate regardless of which server handles a given request. For high-traffic APIs, use Redis Cluster or Redis Sentinel for availability. A Lua script running INCR + EXPIRE atomically prevents race conditions.
What HTTP status code should I return when rate-limiting?
Always HTTP 429 Too Many Requests. Include RateLimit-Remaining: 0, RateLimit-Reset: [unix timestamp], and Retry-After: [seconds until reset] in the response headers. In the JSON body, include an error code, a human-readable message explaining which limit was exceeded, and a link to your rate-limiting documentation. Never return 503 for rate limiting — that signals server unavailability rather than client quota exhaustion.
What is exponential backoff and why is it important?
Exponential backoff is a retry strategy where the wait time doubles with each failed attempt: 1 second after the first failure, 2 after the second, 4 after the third, 8 after the fourth, up to a maximum cap (e.g., 64 seconds). It prevents the thundering herd problem: if 1,000 clients all hit a rate limit simultaneously and all retry after exactly 30 seconds, they create another spike at T+30 that immediately exhausts the quota again. Exponential backoff with random jitter (multiplying the delay by a random factor 0.5–1.5) spreads retries across a time range, giving the API time to recover.
How do I prevent rate-limit boundary exploits with fixed windows?
Fixed windows reset at exact boundaries (e.g., every hour at :00). A client can exploit this by sending 1,000 requests at 12:59:59 and another 1,000 at 13:00:00, effectively sending 2,000 requests in 2 seconds. Sliding windows eliminate this because they count requests in the last N seconds from now, not from a fixed clock boundary. Token bucket is even better: the burst capacity is bounded by the bucket size, not by when the clock ticks over.
Can rate limits be bypassed?
Yes, in several ways: (1) If you use in-process memory instead of shared Redis, clients can hit different servers and bypass limits entirely. (2) If you identify clients by load-balancer IP instead of the real client IP from X-Forwarded-For, all clients appear as one IP with no per-client limit. (3) Fixed windows allow the boundary exploit. (4) If your Redis TTLs are misconfigured, counters may expire early, effectively resetting limits. Mitigation: always use Redis for distributed state, extract real IPs correctly, use sliding windows or token bucket.
How do I monitor if rate limits are working?
Track these metrics: total requests per minute, 429 response rate (alert if > 5% of traffic), 429 distribution by IP and user (sudden spikes indicate abuse), Retry-After header accuracy (do clients actually back off?), and false positive rate (are any legitimate users hitting limits during normal operations?). Set alerts for: any IP generating > 10% of total 429s in a 5-minute window, 429 rate spiking > 3x baseline, and Redis connection errors causing fail-open behavior.
Should internal service-to-service calls be rate-limited?
Internal calls should have different limits, not be exempt. Give each internal service its own API key with elevated quotas that reflect its actual traffic needs. This prevents a misbehaving internal service from consuming the entire quota and taking down external-facing functionality. It also gives you visibility into internal call volumes. Never use the same rate limit tier for both public clients and internal microservices — the traffic profiles are completely different.
What is the IETF RateLimit header standard?
IETF draft RFC draft-ietf-httpapi-ratelimit-headers defines standardized header names: RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset, and RateLimit-Policy. The older X-RateLimit-* variants (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) are vendor-specific extensions without a standard. GitHub, Stripe, and most major APIs currently use the X- variants but are migrating to the IETF standard. New APIs should implement the IETF standard headers and provide X- variants as aliases for backwards compatibility.
How do I set the right rate limit values for my API?
Start by measuring, not guessing. Collect real traffic data for one to two weeks across different user cohorts. Find the P95 request rate for your typical user — this becomes your base limit. Set your limit at 2–3x P95 to give headroom for legitimate spikes without allowing runaway clients. For sensitive endpoints like login, authentication, or password reset, use much lower limits (5–10 requests per minute) regardless of traffic data. Revisit limits every quarter and adjust based on actual 429 rates and customer feedback.
Related Concepts
Related Tools
cURL Converter
Convert cURL commands to code and inspect rate-limit headers from real API responses.
JWT Decoder
Decode JWT tokens and inspect the exp, iat, and all other rate-limiting-relevant claims.
Unix Timestamp Converter
Convert RateLimit-Reset Unix timestamps to readable dates and times.
Bandwidth Calculator
Calculate bandwidth impact of rate-limited vs unrestricted API traffic.