What Is Rate Limiting? How to Stop Your AI-Built API from Getting Crushed

Q: What is rate limiting in simple terms?

Rate limiting is a rule that says 'each user (or IP address) can only make X requests to my API in Y amount of time.' For example, 100 requests per minute. If someone exceeds that limit, the server stops responding to them temporarily instead of processing every single request. It's like a bouncer at a bar — once you've had enough, you're cut off for a while.

Q: What's the difference between rate limiting and authentication?

Authentication checks WHO is making the request (are you logged in? do you have a valid API key?). Rate limiting checks HOW MANY requests someone is making, regardless of who they are. You need both. Authentication without rate limiting means a logged-in user can still abuse your API. Rate limiting without authentication means you can throttle requests but can't identify who's making them. They work together as layers of API security.

TL;DR: Rate limiting controls how many requests a user or IP address can make to your API in a given time window — like 100 requests per 15 minutes. Without it, anyone can hammer your server into the ground, rack up your cloud bill, or brute-force their way through your login system. AI tools almost never include it when they generate API code. The fix takes about 3 lines with express-rate-limit in Node.js — you just need to know to ask for it.

Why AI Coders Need to Understand Rate Limiting

Let's be honest about what happens when you ask Claude or ChatGPT to build an API. You get clean routes, proper error handling, maybe even some input validation if you're lucky. What you don't get is anything that protects your API from being abused once it's live on the internet.

Rate limiting is one of those things that doesn't matter at all during development — your API handles your 5 test requests just fine — and then becomes the most critical thing in the world the moment a real person discovers your endpoint.

Here's what can happen to an unprotected API in production:

Cost explosion. On platforms like Vercel, AWS Lambda, or Railway, you pay per request or per compute second. A bot hitting your API 100,000 times in an hour can generate a bill that makes you physically ill.
Server crash. Even on a fixed-cost VPS, your server has finite memory and CPU. Enough requests will overwhelm it and take down your entire app — not just the API, but your frontend too.
Brute-force attacks. Without rate limiting on your login endpoint, an attacker can try thousands of username/password combinations per second. If any of your users have weak passwords, they're compromised.
Data scraping. If your API returns user data, product listings, or any valuable content, someone can write a script that downloads everything in minutes. Some developers are fighting back with tools like Miasma, which traps AI scrapers in endless loops of poisoned content.
Database overload. Every API request typically triggers one or more database queries. Flood the API, and you flood the database. When the database goes down, everything goes down.

The scariest part? None of this requires a sophisticated attacker. A 14-year-old with a while True loop in Python can take down an unprotected API. That's the reality of putting code on the internet without rate limiting.

The Real Scenario: Your First API Goes Live

The prompt you gave AI:

"Build me a REST API with Express.js that has user registration, login, and a /api/posts endpoint that returns all blog posts from my PostgreSQL database. Include proper error handling."

Totally reasonable prompt. AI gives you a solid API with clean routes, async/await, proper try/catch blocks, maybe even JWT authentication. You deploy it to Railway or Vercel. You share the link. It works beautifully.

Two days later, you check your dashboard. Your database has processed 2.3 million queries. Your Railway bill shows $47 in overage charges. Your error logs show the same IP address making 500 requests per second to your /api/posts endpoint.

Someone found your API (they're easy to find — just look at network requests in the browser) and wrote a script to scrape all your content. They didn't hack anything. They didn't exploit a vulnerability. They just… asked nicely. A lot. Very fast.

Rate limiting would have stopped this on the third request per second.

What AI Generated (and What's Missing)

Here's a simplified version of what AI typically gives you when you ask for an Express API:

const express = require('express');
const app = express();

app.use(express.json());

// Login endpoint
app.post('/api/login', async (req, res) => {
  const { email, password } = req.body;
  // ... authentication logic
  res.json({ token: generatedToken });
});

// Get all posts
app.get('/api/posts', async (req, res) => {
  const posts = await db.query('SELECT * FROM posts');
  res.json(posts);
});

// Create a post
app.post('/api/posts', authenticateToken, async (req, res) => {
  const { title, content } = req.body;
  const newPost = await db.query(
    'INSERT INTO posts (title, content) VALUES ($1, $2) RETURNING *',
    [title, content]
  );
  res.status(201).json(newPost);
});

app.listen(3000, () => console.log('Server running on port 3000'));

This code is functionally correct. The endpoints work. The error handling is there. But here's what's missing from a security perspective:

No rate limiting at all. Every endpoint accepts unlimited requests from anyone.
The login endpoint is wide open. An attacker can try 10,000 password combinations per second.
The GET endpoint has no throttle. Someone can scrape your entire database by requesting /api/posts?page=1, /api/posts?page=2, etc., as fast as their internet allows.
No distinction between endpoint sensitivity. Your login endpoint should have much stricter limits than your public posts endpoint, but AI treats them all the same — unlimited.

This is the gap between "code that works" and "code that survives the internet." AI excels at the first part. You need to handle the second.

Understanding Rate Limiting: The Concepts in Plain English

Rate limiting sounds technical, but the concept is dead simple. Think of it like a bouncer at a bar.

The rule: "Each person can order a maximum of 2 drinks every 30 minutes."

If you walk up and order your third drink in 15 minutes, the bartender says "slow down, come back in a bit." They don't kick you out permanently. They don't call the cops. They just say not right now.

That's rate limiting. Your API says: "Each IP address can make 100 requests every 15 minutes. Request 101? Come back later."

What Happens Behind the Scenes

When a request comes in, your rate limiter checks:

Who is this? Usually identified by IP address, but can also be API key or user ID.
How many requests have they made recently? The limiter keeps a counter.
Are they over the limit? If yes, respond with a 429 Too Many Requests status code. If no, process the request normally and increment the counter.

That 429 status code is the key. It's HTTP's official way of saying "you're asking too much, slow down." Well-behaved clients (including browsers and API libraries) understand this code and will wait before retrying.

The Three Common Approaches (No Math, Promise)

When you read about rate limiting, you'll see terms like "fixed window," "sliding window," and "token bucket." Here's what each one means in human language:

Fixed Window

The simplest approach. Pick a time window — say, 15 minutes. Each user gets 100 requests per window. At the start of each new 15-minute window, everyone's counter resets to zero.

The catch: A user could make 100 requests at 2:14:59 (end of one window) and another 100 at 2:15:01 (start of the next window) — that's 200 requests in 2 seconds. It's fine for most use cases, but good to know about.

Sliding Window

Instead of fixed 15-minute blocks, the window follows each user around. Made a request at 2:07? Your window runs from 2:07 to 2:22. This eliminates the "burst at the window boundary" problem. Slightly more resource-intensive but smoother.

Token Bucket

Imagine each user has a bucket that holds 100 tokens. Every request costs one token. Tokens refill at a steady rate — say, one token every 9 seconds (which works out to about 100 per 15 minutes). If the bucket is empty, the request is rejected. This is nice because it allows short bursts (if your bucket is full, you can make 100 requests quickly) but enforces a steady rate over time.

Which one should you use? If you're using express-rate-limit (the most common Node.js library), you're using fixed window by default — and that's perfectly fine for 90% of applications. Don't overthink this. The difference between "no rate limiting" and "any rate limiting" is infinitely more important than the difference between algorithms.

What Your Code Looks Like With Rate Limiting

Here's the same Express API from earlier, but now with rate limiting added. The changes are highlighted:

const express = require('express');
const rateLimit = require('express-rate-limit');  // ← NEW
const app = express();

app.use(express.json());

// General rate limit: 100 requests per 15 minutes per IP
const generalLimiter = rateLimit({              // ← NEW
  windowMs: 15 * 60 * 1000,  // 15 minutes
  max: 100,                   // limit each IP to 100 requests per window
  standardHeaders: true,      // Return rate limit info in headers
  legacyHeaders: false,
  message: {
    error: 'Too many requests. Please try again in a few minutes.'
  }
});

// Strict rate limit for auth endpoints: 10 requests per 15 minutes
const authLimiter = rateLimit({                 // ← NEW
  windowMs: 15 * 60 * 1000,
  max: 10,                    // Much stricter for login attempts
  message: {
    error: 'Too many login attempts. Please try again later.'
  }
});

app.use('/api/', generalLimiter);               // ← NEW: Apply to all API routes

// Login endpoint — uses the STRICTER limiter
app.post('/api/login', authLimiter, async (req, res) => {
  const { email, password } = req.body;
  // ... authentication logic
  res.json({ token: generatedToken });
});

// Get all posts — uses the general limiter (applied above)
app.get('/api/posts', async (req, res) => {
  const posts = await db.query('SELECT * FROM posts');
  res.json(posts);
});

// Create a post
app.post('/api/posts', authenticateToken, async (req, res) => {
  const { title, content } = req.body;
  const newPost = await db.query(
    'INSERT INTO posts (title, content) VALUES ($1, $2) RETURNING *',
    [title, content]
  );
  res.status(201).json(newPost);
});

app.listen(3000, () => console.log('Server running on port 3000'));

That's it. Two rate limiter objects and one app.use() call. Your entire API is now protected. Let's break down each piece:

Code	What it does
`windowMs: 15 * 60 * 1000`	Sets the time window to 15 minutes (in milliseconds)
`max: 100`	Each IP address gets 100 requests per window
`standardHeaders: true`	Sends `RateLimit-Remaining` and similar headers so clients know their limits
`message: { error: '...' }`	What the client sees when they hit the limit (instead of a cryptic server error)
`app.use('/api/', generalLimiter)`	Applies the general limiter to every route starting with `/api/`
`authLimiter` on login	A second, stricter limiter that only allows 10 attempts per 15 minutes on the login endpoint

Why two different limiters? Your /api/posts endpoint is relatively harmless — someone reading blog posts 100 times in 15 minutes is annoying but not dangerous. Your /api/login endpoint is a completely different story. 10 login attempts per 15 minutes is generous for a real user but devastating for a brute-force attack that needs thousands of attempts to guess a password.

The Prompt to Give Your AI

Copy this prompt:

"Add rate limiting to my Express API using express-rate-limit. I want:
- A general limiter: 100 requests per 15 minutes per IP for all /api/ routes
- A strict limiter: 10 requests per 15 minutes for /api/login and /api/register
- A very strict limiter: 3 requests per hour for /api/forgot-password
- Return a clear JSON error message when the limit is hit
- Include the standard rate limit headers in responses
- Use the recommended settings for production behind a reverse proxy (trust proxy)"

That prompt gives AI the specific requirements it needs. Without specifics, you'll get the bare minimum or nothing at all.

What AI Gets Wrong About Rate Limiting

When you ask AI to add rate limiting, it usually gets the basics right. But there are several common mistakes that can leave you exposed:

1. Same Limit for Every Endpoint

AI typically creates one limiter and applies it globally. But your login endpoint needs much stricter limits than your public data endpoints. A /api/search endpoint might reasonably need 200 requests per minute, while your /api/login should be capped at 10. One-size-fits-all rate limiting is better than nothing, but it's not enough for production.

2. Forgetting Trust Proxy

This one is sneaky. If your app runs behind a reverse proxy (nginx, Cloudflare, or any cloud platform's load balancer — which is almost always the case in production), every request appears to come from the proxy's IP address, not the actual user's IP. That means your rate limiter treats ALL users as one user and blocks everyone after 100 total requests.

// AI often forgets this critical line:
app.set('trust proxy', 1);

// Without it, every user shares the same rate limit counter
// because all requests appear to come from the proxy's IP

This is one of the most common production bugs with rate limiting. Your API works perfectly in development (where there's no proxy), then breaks immediately in production because all traffic looks like it's from one IP.

3. In-Memory Storage Only

The default express-rate-limit setup stores request counts in your server's memory. This has two problems:

Server restart = all limits reset. Every time your app restarts, everyone's counter goes back to zero.
Multiple servers = separate counters. If you scale to 2+ server instances, each one has its own counter. A user could make 100 requests to server A and 100 to server B, effectively doubling their limit.

For a single-server app that doesn't restart often, in-memory is fine. But if you're scaling or deploying to serverless (Vercel, AWS Lambda), you'll need Redis or a similar external store. Tell your AI: "Use Redis to store rate limit counters so they persist across server restarts and multiple instances."

4. No Custom Response Format

AI often leaves the default error response, which can be a plain text string. If your API returns JSON everywhere else, the rate limit response should also be JSON. Inconsistent response formats break client-side error handling.

5. Missing the Headers

Good rate limiting includes response headers that tell the client their limits: RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset. This lets frontend code show users "you have 23 requests remaining" instead of just hitting a wall with no warning. AI frequently omits these.

How to Debug Rate Limiting Issues

Rate limiting can cause confusing problems if it's misconfigured. Here are the most common issues and how to fix them:

Problem: "My own requests are getting blocked during development"

If you're testing your API and suddenly every request returns 429 Too Many Requests, you've hit your own rate limit. During development, you make lots of rapid requests while testing.

// Quick fix: higher limits in development
const limiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: process.env.NODE_ENV === 'production' ? 100 : 1000,
});

Problem: "Rate limiting blocks all users at once"

This is the trust proxy issue described above. Check if you're behind a reverse proxy and add app.set('trust proxy', 1). You can verify by logging req.ip — if every request shows the same IP (like 127.0.0.1 or a private address), you need the trust proxy setting.

Problem: "Limits reset every time I deploy"

You're using the default in-memory store, and deployments restart the process. Move to Redis or another external store for persistent counters:

const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const client = new Redis(process.env.REDIS_URL);

const limiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 100,
  store: new RedisStore({
    sendCommand: (...args) => client.call(...args),
  }),
});

Problem: "I'm getting rate limited by a third-party API"

This is the other side of rate limiting — not your API, but someone else's. If you're calling the OpenAI API, Stripe, or any external service from your backend, they have rate limits too. The fix here is implementing retry logic with exponential backoff:

// When a third-party API returns 429, wait and retry
async function callWithRetry(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.status === 429 && i < maxRetries - 1) {
        const waitTime = Math.pow(2, i) * 1000; // 1s, 2s, 4s
        await new Promise(r => setTimeout(r, waitTime));
      } else {
        throw error;
      }
    }
  }
}

Debugging tip: When you hit a 429 error, check the response headers. Most rate-limited APIs include Retry-After (how many seconds to wait), RateLimit-Remaining (how many requests you have left), and RateLimit-Reset (when the window resets). These headers tell you exactly what's happening.

Rate Limiting Beyond Express

We've focused on Express.js because it's what AI generates most often, but rate limiting applies everywhere:

Next.js API routes: Use next-rate-limit or implement middleware in your /api/ route handlers.
Python/Flask: Use flask-limiter — same concept, different syntax: @limiter.limit("100 per 15 minutes").
Python/FastAPI: Use slowapi, which wraps the same logic for async Python.
At the infrastructure level: Cloudflare, nginx, and AWS API Gateway all offer rate limiting before requests even hit your code. This is the most efficient approach for high-traffic APIs. If you're using Docker with nginx as a reverse proxy, nginx can handle rate limiting at the network layer.
Stripe and payment APIs: When you're working with Stripe webhooks, be aware that Stripe has its own rate limits on API calls. Your webhook handlers should process events efficiently to avoid hitting those limits.

The layer where you implement rate limiting matters. Infrastructure-level (Cloudflare, nginx) stops bad traffic before it reaches your application server. Application-level (express-rate-limit) gives you fine-grained control per endpoint. The best approach is both — a broad limit at the infrastructure level and specific limits at the application level.

Rate Limiting in the Bigger Security Picture

Rate limiting is one layer in a stack of protections your API needs. Think of it like a house:

Authentication is the front door lock — it verifies who's allowed in.
Rate limiting is the security camera and the "no loitering" sign — it limits how much anyone can do, even if they're allowed in.
API security best practices (input validation, HTTPS, CORS) are the walls and windows — they prevent people from sneaking in through other openings.
Monitoring and logging are the alarm system — they tell you when something suspicious is happening.

No single layer is enough. Rate limiting won't stop an attacker with a valid API key from making authorized-but-malicious requests within the limit. Authentication won't stop a distributed attack from thousands of IPs. You need all the layers working together.

What's Next

Now that you understand rate limiting, here's what to learn next:

API Authentication Guide — Rate limiting controls how much someone can do. Authentication controls who can do it. These two are always paired together.
API Security Best Practices — The complete picture of securing your AI-built API, including HTTPS, input validation, CORS, and more.
What Is Docker? — If you're deploying your rate-limited API, Docker is likely involved. Understanding containers helps you configure things like trust proxy correctly.
What Are Stripe Webhooks? — Payment processing is one of the most common use cases where rate limiting and API security intersect. Stripe's own rate limits affect how you build webhook handlers.

The one thing to remember: The difference between a hobby project and a production API isn't features — it's protection. Rate limiting is the single easiest protection to add and the most common one to forget. Next time you ask AI to build an API, add six words to your prompt: "Include rate limiting for all endpoints." That's it. That's the whole lesson.

Frequently Asked Questions

What is rate limiting in simple terms?

Rate limiting is a rule that says "each user (or IP address) can only make X requests to my API in Y amount of time." For example, 100 requests per minute. If someone exceeds that limit, the server stops responding to them temporarily instead of processing every single request. It's like a bouncer at a bar — once you've had enough, you're cut off for a while.

Why doesn't AI add rate limiting when it builds my API?

AI tools like ChatGPT and Claude are optimized to give you working code that does what you asked. When you say "build me an API," they focus on making the endpoints work correctly. Rate limiting is a production concern — it's about what happens when real users (or attackers) hit your API at scale. AI doesn't think about abuse scenarios unless you specifically ask. That's why you need to explicitly prompt: "Add rate limiting to protect against abuse."

What happens if I don't add rate limiting to my API?

Without rate limiting, anyone can send thousands of requests per second to your API. This can crash your server, spike your cloud hosting bill (especially on pay-per-request platforms like Vercel or AWS Lambda), let attackers brute-force passwords or API keys, scrape all your data, and overwhelm your database. A single person with a simple script can take down an unprotected API in seconds.

What's the difference between rate limiting and authentication?

Authentication checks who is making the request (are you logged in? do you have a valid API key?). Rate limiting checks how many requests someone is making, regardless of who they are. You need both. Authentication without rate limiting means a logged-in user can still abuse your API. Rate limiting without authentication means you can throttle requests but can't identify who's making them. They work together as layers of API security.

How do I choose the right rate limit numbers?

Start conservative and adjust based on real usage. A good starting point for most AI-built APIs: 100 requests per 15 minutes for general endpoints, 5–10 requests per 15 minutes for login/auth endpoints, and 20–30 requests per minute for API-key-authenticated routes. Monitor your logs for a week. If legitimate users are getting blocked (429 errors), increase the limits. If you're seeing suspicious traffic patterns, tighten them. There's no universal perfect number — it depends on what your API does.