Published on

Rate Limiting / API Throttling


What is Rate Limiting?

The job of a Rate Limiter is to decide if a request will be served, or declined.

If the request is accepted, then it’ll be passed to the API servers.

In most cases, a rate limiter is used as a defensive mechanism.

It relies on some rules to decide when to limit the traffic.

Limit the number of requests a user can make in a given period of time.

If this limit is exceeded, the user or IP needs to be throttled.

This means that, any further requests that are made to the API, by that user, will fail or will added to a queue to be processed later.

Without this mechanism, any user can blast your server with requests, leading to high latencies, or even unavailable services.

Why use a Rate Limiter?

Rate limiting is crucial for maximizing throughput and minimizing end-to-end latency across large distributed systems.

Shared services need to protect themselves from excessive use—whether intended or unintended—to maintain service availability.

  • Preventing resource starvation
  • Managing policies and quotas - rate limiting per user to provide fair and reasonable use, without affecting other users.
  • Controlling flow - distribute work more evenly between workers
  • Avoiding excess costs
  • Security - avoid DoS attacks

Who Uses API Rate Limiter?

Large scale applications, like LinkedIn, GitHub, Twitter, Facebook, and so on, they’re all employ some kind rate limiters on the API, as a safeguard tool.

So, it’s no wonder that “Designing an API rate limiter” is a popular question, in many software design interviews.

Plan of Action

As we’ll see, there are multiple ways and algorithms to build a rate limiter, and each implementation has some strength and some weaknesses.

This is why is essential to know what we want in the beginning.

1. Requirements and Goals of the System
    - Scope the problem
    - Functional Requirements
    - Non-Functional Requirements
2. High-level Design
    - Discuss the components
    - How the components fit together
    - Data Handling
3. Detailed Design
    - Data Modeling
    - Exception Handling
    - Parallel Computing
    - Distributed Environment
4. Algorithms
    - Leaking bucket
    - Token bucket
    - Fixed window
    - Sliding window counter
    - Sliding window log

Functional Requirements

For the rate limiter, the Functional Requirements seem straightforward.

We limit the number of calls that an entity can make to an API, within a given time frame.

Only 2 requests per second allowed, or 10 accounts per day.

However, it’s also important to establish how we communicate to the clients that their requests are being throttled, and how we display the status of the system.

Non-Functional Requirements

  • Performance: The system should be able to handle a high volume of requests and apply rate-limiting rules efficiently, without adding a lot of latency

  • Scalability: The system should be able to scale horizontally to accommodate an increase in traffic or user demand.

  • Reliability: The system should be able to continue functioning even in the face of failures, such as network outages or hardware failures.

  • Configurability: The system should be easily configurable, allowing administrators to define and enforce rate-limiting rules.

  • Monitoring: The system should provide visibility into its performance, allowing administrators to monitor and optimize its behavior.

  • Robustness: The system should be able to handle malformed or malicious requests, protecting against attack vectors such as denial of service (DoS) attacks.

Rate Limiter Components 🧱

Request counter: This component counts the number of requests received within a specified time period, typically a sliding window of time.

Rule engine: This component defines the rate-limiting rules, such as the maximum number of requests allowed per second, minute, or hour.

Throttling mechanism: This component enforces the rate-limiting rules, blocking or delaying requests that exceed the specified rate limit.

Request classification: This component classifies incoming requests into different categories based on factors such as the client's IP address, user agent, or URL.

Storage: This component stores the request counter data and other information needed for rate limiting, such as the current state of the system. This data can be stored in-memory, on disk, or in a database.


Leaking bucket 🚰


Leaking bucket algorithm is very simple and elegant.

It’s used widely used in computer networks, and it’s also the default rate limiter for the NGINX web server.

Here, the analogy is with a bucket, where water is poured in, at the top, and water also leaks from the bottom;

If the rate of pouring in water exceeds the rate of water leaking, then the bucket overflows.

This simple engineer concept will smooth out traffic, and it will allow it to pass only at a constant rate.

Basically, the bucket is a queue where requests wait to be processed, according to a first‑in‑first‑out (FIFO) scheduling algorithm.

When a request arrives the system first checks if the queue is not full before adding it.

  • At regular intervals - Requests are pulled from the queue and processed.
  • In case of overflow, the requests are dropped.

This algorithm takes two parameters:

  • Bucket, or queue size. It defines the number of maximum requests, that can wait in the queue.
  • and, the Process rate: which specifies how many requests can be treated at a fixed rate; Usually, in seconds.

Let’s see what are pros and cons, of this algorithm.

👍 Pros:

  • first, it’s Memory efficient - given the limited queue size.
  • then, this algorithm, It’s suitable for use cases, when a stable processing rate is necessary.

👎 Cons:

  • if the queue is full of old requests that are not processed in time, then newer requests will be rate limited.
  • the two parameters, process rate and bucket size, which are not always easy to adjust properly

Token bucket algorithm 🪙

The token bucket algorithm is similar to leaky bucket one, but because it uses tokens, it can process some bursts of requests.

In contrast, leaky bucket algorithm always process requests at constant rate.


This is not good or bad. It will depend on what are the needs.

Amazon Elastic Cloud and Amazon API Gateway, both are using this algorithm to throttle their API requests.

Let’s see how it works.

As the name suggests, we have a token bucket, or a container that has pre-defined capacity.


Then we add tokens into the bucket, at periodic moments of time.

The bucket can store tokens until it gets full.

If extra tokens are added then they will overflow.

Then, when some requests come they will consume one token each.

If there are tokens in the bucket, then the requests will be processed.

If they bucket is empty, then new requests will be dropped.

So, this algorithm will take two parameters:

  • the Bucket size: which represents the maximum number of tokens allowed in the bucket
  • and, the Refill rate: which is the number of tokens put into the bucket periodically

How many buckets do we need?

This depends on our rate-limiting rules. Here are a few examples. It is usually necessary to have different buckets, for different API endpoints. For instance, if a user is allowed to make 3 accounts per day, add 5 articles per hour, and comment 50 times per hours, then 3 buckets are required for each user.

Let’s see which are the pros and cons of this algorithm.

👍 Pros

  • it requires Tiny memory usage, since, we have only two numbers per counter
  • this algorithm allows a burst of traffic for short periods. The burst size will depend on the numbers of tokens available.

👎 Cons:

  • the two parameters, refill rate and bucket size, are not always easy to adjust properly

Fixed window

Fixed window algorithm, is one of the most basic rate limiting mechanisms.

How does it work?

The algorithm divides the timeline into fix-sized time windows.

Then we assign the same capacity for each window.

For example, we can allow maximum 4 request per second.


When a request will come it will increment the counter by one.

This algorithm accepts requests until the capacity is reached, for the given time window.

Further requests will be dropped until a new time window starts.


However, this algorithm has a major flow.

Let’s zoom in, on the timeline.


Now, if 4 requests are coming in the last half, of a second, and another 4 requests will come, in the first half of the next second, then this algorithm will accept them.

However, in this time interval of 1 second, 8 requests were allowed.

That’s twice the allowed capacity for a time window.

Let’s see the pros and cons:

👍 Pros:

  • This algorithm it’s Easy to implement
  • It has Small memory footprint, ’cause all that is being done, is storing the counts.

👎 Cons

  • However, it’s not accurate. It can permit more requests to go through, than the configured quota.

Sliding window counter

The natural thing to do to improve, the fixed-size window algorithm, is to use a sliding window.

Again, we configure the rate limiter, to allow a maximum of 4 requests per second.


Now, let’s consider that in the current windows of second, there are already 2 requests; and in the previous second, we had 1 request.

Let’s say that a new request comes in at a 50% position in the current window.

Now, we’ll have to consider a sliding window of 1 second that ends just at the new request timestamp.

Then, the number of requests in the rolling window is calculated using the following formula:


Using this formula, we get 3.5 requests for the sliding window.

In this case we get a decimal result which we can round up if we want to be more rigid,

Or we could round it down to allow 1 more request per window.

Since the rate limiter allows a maximum of 4 requests per second, per sliding window, the last request can go through.

However, this algorithm assumes a constant rate of requests during the previous sampling period.

For instance, here it was assumed a constant rate of 1 for the previous second.

However, if that 1 request happened in the first part of the second; Then, a more accurate result of the sliding would have been 3.

The algorithm can be improved, but in practice, the simple formula we used, proved to be good enough, for most cases.

A slightly improved version of this, is used at Cloudfare to handles efficiently billions of requests per day.

Let’s see the pros and cons, of this algorithm:

👍 Pros

  • solves the accuracy problems of fixed window implementation.
  • this algorithm is nice on resources, since it has a small memory footprint.
  • it avoids the starvation problem of the leaky bucket algorithm

👎 Cons

  • it does not calculate an exact rate for the sliding window. It assumes that requests in the previous window, are evenly distributed. However, as mentioned, the difference in actual result is so small it’s negligible.

Sliding window log

The algorithm of sliding window log fixes the issue of perfect accuracy.

In this algorithm, we use log or a register.

Here, we define the capacity as the number of allowed entries in the log per timeframe.

Let’s say the rate limiter allows 2 requests per second.


When a new request comes in, the first thing the algorithm does is to remove outdated timestamps.

Outdated timestamp, means those older than current time window, in this case older than 1 second.

We have nothing to remove yet.

Second, it adds the timestamp of the request, to the log (1669200000100).

Since the log is empty when the new request arrives, the request is allowed to pass.

Then, a new request arrives 100 milliseconds later. (1669200000200).

We have no timestamp to remove yet.

And, the new timestamp is inserted into the log.

After the insertion, the log size is 2, not larger than the allowed count.

Thus, the request is allowed.

Then, another request arrives 100 milliseconds later.

Still nothing to remove, all timestamps are no older than 1 second.

After the insertion, the log size becomes 3, larger than the allowed size of 2.

Therefore, this request is rejected, even though the timestamp remains in the log.

Finally, a request arrives 900 milliseconds later. (1669200001200)

If we look at the time, the last 2 requests are in the range of last 1 second, but the previous requests are outdated.

So, the first two outdated timestamps are removed from the log.

Then, we insert the new timestamp.

After the remove operation, the log size becomes 2; therefore, the request is accepted.

Let’s see the pros and cons of this algorithm.

👍 Pros:

  • a rate limiter implemented with this algorithm is very accurate. In any rolling window, requests will not exceed the rate limit.
  • this implementation, also avoids the starvation problem of the leaky bucket.

👎 Cons:

  • this algorithm is not memory efficient. It eats a lot of memory. Even in the case when the request is rejected, its timestamp is still stored in memory.

What algorithm is best?

Even though we didn’t get really deep into the algorithm implementations, it helped to understand them at high-level, in order to choose the right one for a right use case.

As we saw, each algorithm has its pros and cons.

However, 2 algorithms stood out in all categories.


The token bucket algorithm is good all around, but it needs strong transactional support, when high concurrency is needed.

However, sliding window counter offers the same benefits, although it requires a basic transactional support for high concurrency.

Detailed Design

Configuring the Rate Limiter

Now, it would be great if the system would allow to configure different rate liming rules, for different use cases.

For example, one rule might be to limit the number of accounts a user can create to 5 per day.

Then, with another rule we might limit a user to create no more than 5 posts per hour.

  - action: create_account
      unit: day
      requests_per_unit: 5
  - action: create_post
      unit: hour
      requests_per_unit: 5

First, where do we store these rules?

Since we probably want to change the rules often, it would make sense to define them outside the code.

A good option is to save these rules on a configuration file like YAML and save them on disk.

Here, we have an example of the previous rules.

Furthermore, for fast access we can put these rules in a cache, and write-back to disk only at periodic intervals of time.

Rate Limiter Exception Handling

Traditionally, web applications respond with status code 429, too many requests if the rate limit is surpassed.

For other use cases, for instance when a system is overloaded we might add the extra calls to a queue, like Kafka, so that we could process the requests later.

In this case we might respond, with a status code HTTP 202, Accepted.

Furthermore, to communicate better the status of the rate limiter, we can add more, than just the HTTP status code.

For example, for Rate Limiting, Twitter and Github, are using the following HTTP headers in the response.

For different resources, they’re using at least the following 3 headers:

X-Ratelimit-LimitLet the client know how many calls he can make per time window.
X-Ratelimit-RemainingPut information about the remaining number of allowed requests, within the time window.
X-Ratelimit-ResetShows the timestamp of the current time window.

Race condition 🏁

In an environment where we have high concurrency, we almost always encounter race conditions.🏁

In general, a rate limiter does 2 things for each requests:


First, it checks the current counter value to see if the requests can be served;

Second, it increments the counter by 1, and writes the value back into memory.

Now, let’s consider that two requests read the counter at about the same time.

They both read the same value, let’s say it’s 4.

Then, they both increase the value read with 1.

And, finally they try to write back the new value.

However, these requests don’t know about each other without synchronization.

In this case, both threads believe they write back the correct counter with value 5.

But the correct value of the counter should be 6.

One strategy to solve this problem, is to use Locks.

We lock on a certain resource, while it’s being used by another request.

Then, we wait until the lock is released for that resource.

However, this kind of synchronization may create blocking situations which might slow down the system.

Therefore, when we use locks we have to be careful to reduce the area that needs synchronization.


In a nutshell, a rate limiting implementation is a traffic control mechanism.

It has rules and policies set by the API’s operator, or owner, to ensure that the service has constant performance and always-on availability.

This should be valid even when the service is hit by high spikes in the number of requests.

Therefore, a rate limiter is a good addition for both security, and quality control.