- Published on
Rest API - Best Practices - Performance
- Authors
- Name
- Lucian Oprea
- @LucianDSA_
Introduction
Fast responses are crucial for user satisfaction on the internet. Users quickly abandon slow-loading sites, much like a kangaroo bouncing away.
With countless ways to enhance API performance, we'll focus on the most impactful strategies that offer the greatest benefits with minimal implementation effort.
Compression 🗜️
The What?
Data compression can significantly speed up response times. The improvement depends on the compression ratio, which varies by content type:
- Text-based responses (JSON, XML, HTML, plain text): Up to 10x size reduction, potentially 10x faster
- Binary files (images, videos): Up to 50% size reduction, potentially 2x faster
The Why?
HTTP compression is especially beneficial for:
- High-latency connections (long-distance or satellite)
- Scenarios where multiple server locations aren't feasible
- Large file transfers
Note: While compression requires more CPU resources, the performance gains often outweigh this cost for network-bound applications.
The How?
To implement HTTP-level compression:
Server-side (API Provider):
Enable compression on the server. Most web servers and frameworks offer built-in features or plugins for this. For instance, in Nginx:
gzip on;
gzip_types text/plain text/html text/css application/javascript;
Set the proper HTTP headers to indicate compression, like the "Content-Encoding" header specifying the compression algorithm (e.g., "gzip" or "deflate").
Client-side (API Consumer):
Add an "Accept-Encoding" header in the HTTP request to specify the desired compression (e.g., "gzip, deflate"). Omitting this means the client prefers uncompressed responses.
Ensure the client handles decompression. Browsers handle this automatically, but for custom applications, configure your HTTP client library to manage it.
Async communication ✉️
The What?
Asynchronous requests greatly enhance throughput and scalability by allowing multiple requests to be processed simultaneously without waiting. This makes applications more efficient and capable of handling more traffic.
In a synchronous system, a user waits for one request to finish before making another. In an asynchronous system, the user can make multiple requests at once, creating the appearance of parallel processing, even though only one thread is usually involved. The result is faster processing.
Let’s explore how this works in an asynchronous system.
The Why?
An e-commerce website typically has a shopping cart feature. When a user proceeds to checkout, the site needs to calculate the total, apply discounts, and check inventory.
These tasks might be handled by separate services. In a synchronous approach, we can process them in two ways:
Sequentially, where each operation waits for the previous one to finish, causing delays. In parallel, by using multiple threads. However, managing many threads can consume significant memory and CPU, making this approach inefficient for non-intensive tasks like these. Parallel threads are better suited for heavy computational tasks like cryptocurrency mining or machine learning.
The How?
Now, let’s see how this would work in an asynchronous approach:
For each item user adds in the cart, we make asynchronous HTTP requests to do calculations and inventory checks.
We do this simultaneously without waiting for each response. Here's what happens:
- Concurrent requests are made to the pricing and inventory services, for all items in the cart (Item 1, Item 2, etc.).
- We see that the website doesn't wait for any specific response, and continues to provide a responsive user interface.
- As responses arrive, the website updates the cart's total price and inventory status for each item.
Rate Limiting 🚫
The What?
A REST API with a rate limiter acts like a club bouncer, controlling access and preventing overload.
Most companies, like GitHub, use rate limiting to protect their APIs.
For example, GitHub limits how many requests a user can make per hour, ensuring fair usage and preventing a flood of requests from taking down the API.
The Why?
When a server is overloaded, it can become slow or unstable. Rate limiting helps prevent this by controlling the number of requests, ensuring stable service quality.
For example, it can guarantee a certain response time or allow clients a set number of requests per time period.
It also ensures fairness. Without rate limiting, one client could monopolize resources, affecting others. By treating all clients equally, rate limiting maintains consistent performance for everyone.
The How?
To implement a rate limiter on the server side, follow these steps:
Server-side:
- Define Rate Limit Rules: Set the maximum number of requests allowed per client, endpoint, or time window.
- Choose a Rate Limiting Algorithm: Select an appropriate algorithm, like Token Bucket or Sliding Window, based on your needs.
- Track and Store Request Metrics: Monitor the number of requests per client and store this data in a cache or database.
- Enforce Rate Limits: Before processing requests, check if the client has exceeded the limit. If so, respond with an appropriate status code (e.g., 429 Too Many Requests).
Client-side:
- Respect Rate Limit Headers: Adjust request behavior based on rate limit information in response headers (e.g., X-RateLimit-Limit, X-RateLimit-Remaining).
- Implement Backoff Strategies: Handle 429 responses gracefully, using retry logic based on the Retry-After header if available.
HTTP Caching ☁️
The What?
Caching speeds up API responses by temporarily storing and reusing processed data, reducing redundant server requests.
This not only boosts performance but also lightens the load on the server, preventing it from handling the same requests repeatedly.
The Why?
Implementing caching can significantly improve performance by:
- Reducing Response Time: Cached resources are served directly, avoiding database queries or complex computations, which speeds up response times and lowers API latency.
- Saving Bandwidth: Reusing cached responses minimizes data transfer between server and client, especially for large responses or across distant regions.
For REST APIs, caching can be done using HTTP logic without additional systems. The client, like a browser or app, checks if its cached copy is up-to-date.
If unchanged, the server replies with HTTP 304 (Not Modified), allowing the client to use the cached copy, saving transfer time and resources.
The How?
To implement a caching mechanism on the server side, follow these steps:
Server-side:
- Identify Cacheable Resources: Determine which API endpoints can benefit from caching, especially static or infrequently changing resources.
- Set Cache-Control Header: Use the
Cache-Control
header in API responses to enable caching. You can set it topublic
(cacheable by browsers and proxies) orprivate
(cacheable only by the client). Define the cache duration using themax-age
directive (e.g., 1 year is31536000
seconds). - Choose a Caching Strategy: Use a time-based expiration with
Last-Modified
headers or a content-based validation withETag
hashes to decide if cached resources are still valid. - Implement Cache Storage: Choose a caching mechanism (in-memory, distributed cache, or a caching server like Redis) to store and retrieve cached responses.
Client-side:
- Inspect Cache-Control Headers: When receiving a resource for the first time, check the
Cache-Control
header to know if it can be cached and for how long. Store it and also save headers likeLast-Modified
orETag
for future validation. - Use Conditional Requests: When requesting the resource again, use conditional requests with headers like
If-Modified-Since
(for time-based validation) orIf-None-Match
(for content-based validation with ETag). If the server responds with304 Not Modified
, use the cached version; if it responds with200 Success
, update the cache.
Efficient Serialization 📝
The What?
When it comes to serialization, JSON and XML are the most popular formats, supported across all programming languages and platforms.
Serialization is the process of converting application objects (like those from Java or Python) into JSON or XML for storage or network transmission.
Deserialization is the reverse, transforming the file representation back into application objects.
The Why?
The downside of JSON and XML is that they are not the most efficient encoding methods. In terms of performance,
Protocol Buffers generally outperform JSON and XML in serialization/deserialization speed and message size. Their binary format allows for significantly reduced payload sizes, improving network transfer speeds, especially when message sizes are large and bandwidth is limited.
Protocol Buffers are similar to HTTP compression in that both reduce the size of text-based formats, but compression does not fix the inefficiencies of those formats, whereas Protocol Buffers do.
If compatibility with existing systems or human readability is a priority, text formats with compression can still be viable options.
The How?
To send data between two applications using Protocol Buffers, follow these steps:
Server-side:
- Define the Protocol Buffers Schema: Create a
.proto
file to define message types and their fields using Protocol Buffers syntax. This schema serves as a contract between server and client, with unique numbers assigned to each field for identification in the binary output. - Generate Code: Use the
protoc
compiler with the-python-out
argument to generate Python-specific code from the.proto
file. This code provides classes for the defined message types and handles serialization and deserialization. - Serialize the Data: In the server code, use the generated code to create and serialize Protocol Buffer messages for transmission over the network, often sending them as request or response content in a REST API.
- Set Appropriate Content-Type Header: Indicate that the response data is in Protocol Buffers format by using the content type "application/x-protobuf".
Client-side:
- Generate Client Code: Use the same
.proto
file with theprotoc
compiler to generate client-specific code in the desired programming language. - Set Appropriate Accept Header: Specify that the expected response data is in Protocol Buffers format by setting the header to "application/x-protobuf".
- Deserialize Response Data: When the client receives a response, use the generated code to deserialize the binary Protocol Buffers data into the corresponding message object, allowing access to the desired data through defined accessors or properties.
Long-running requests 📤
The What?
A request that might take a long time to process, such as complex calculation or bulk processing, should be performed without blocking the client that submitted the request.
The Why?
Making clients wait for a response can lead to several issues:
Reduced Scalability: Long timeouts can decrease performance and resource utilization on both the server and client. As requests hold resources for extended periods, it can lead to resource exhaustion and reduced scalability.
For example, most web servers have a limited number of allowed connections. Long timeouts can occupy connection pools for too long, limiting the availability of connections for new incoming requests. This may result in connection pool exhaustion and prevent new requests from being processed.
The How?
Let’s look at a simple example involving a dataset provided to a server for calculations like mean, median, min, and max, which is a long-running process.
Return HTTP Status Code 202 (Accepted): This indicates that the request has been accepted for processing but is not yet completed. The web API performs initial checks to validate the request, initiates a separate task for processing, and returns the 202 status code.
Offload to a Background Task: The request is processed in the background while the client receives the initial response.
Expose a Status Endpoint: In addition to the main endpoint, create an endpoint to check the status of the asynchronous request. The client can poll this endpoint at regular intervals to monitor progress.
Check Results: Once the status shows that processing is done, the client can check the original endpoint for the results.
A common question is what is considered a long response time? Processing times for long-running tasks can range from seconds to hours or even days. As a best practice, any process taking more than a couple of seconds should be implemented using the asynchronous mechanism described.