Payment System Design

How a payment system works?

Before diving into the interview questions, it’s useful to have a high level view on the payment systems used globally.

Normally, a customer places an order to a merchant website. To complete this the customer has to provide the payment information.
Next, the merchant sends the customer to the payment form page to introduce the payment details. Normally, this form page is provided by the Payment Gateway as well.

To operate legally, this service has to manage many compliance rules including PCI DSS and GDPR. This Gateway also has several other functions.

For instance, it can forward the request to advanced verification services for risk and fraud prevention. We’ll discuss it later in more details.

So, the main function of the payment gateway is to validate financial credentials, and transfer them to a merchant’s bank account.

Now, the cardholder’s information are transmitted to the acquiring bank. This is the bank that processes card payments, on behalf of a merchant. Now, we can define a Payment Service Provider (PSP). This a broader term for a third-party company that assists businesses in facilitating payments safely and securely. A PSP usually offers services such as risk management, reconciliation tools, and sometimes even services for orders management. PSPs can also be the acquiring bank, but not necessarily.
Next, the acquiring bank captures the transaction information, performs basic validation and routes the requests, along the appropriate card networks, to the cardholder’s issuing bank for approval.
Finally, the customer bank receives the transaction information, and responds by approving or declining the transaction. It can check that the transaction information is valid, the cardholder has sufficient balance to make the purchase, and that the account is in good standing.
And then, the transaction follows the same route back to the merchant. The merchant will receive a status of the transaction, which is also displayed to the client.

Non-functional Requirements

A payment system is easy to understand at the functional level.

It needs to move money from account A to account B.

However, what’s more difficult is to make the system reliable. especially when the unknown situations are revealed.

A small slip could potentially cause significant revenue loss.

And, there’re a lot of things to consider when building a reliable payment system.

In this article, we’ll focus more on the technical concepts as they are applicable to almost every system.

The business part would depend on each particular system.

Also, we’ll also see how we could handle large throughput of payment requests.

Payment System Components 🧱

Let’s say we need to build a payment system for an online store.

Then, we should provide at least the following core features.

First, when a user clicks the “place order” button, a payment event is generated and sent to the payment service. This Service will coordinates the payment process. First, it will store the payment event in the database.
Second, the payment service will call an external payment service provider, PSP, in order to process the card payment. When we call the PSP we should provide at least the monetary amount, and currency. This is normally captured from the check-out page.
The user will see the payment page. This is where the payment details are collected. The common way of collecting and forwarding payment data is via a form page, provided by the PSP.
Then, the main function of the PSP is to send card details to banks or card schemes.
After the PSP has successfully processed the payment, the coordinator service updates the wallet. In order to keep track of the account balance of the merchant we can use the concept of Wallet.
Then, the updated balance information of the wallet is also stored in the database.
Then, after the wallet service has successfully updated the merchant balance, the payment service updates the ledger. This involves logging all the financial transactions record by record.
The ledger service appends the new information to a database. This is important in post-payment analysis when calculating the total revenue of the e-commerce website or to support auditing.

Why Asynchronous Payments❓

Scalability: Asynchronous communication allows the system to handle a large number of requests and transactions without blocking the main thread or causing a bottleneck in the system. This allows the system to scale up and down as needed to handle varying traffic levels.

Performance: By using asynchronous communication, the system can process transactions more efficiently and quickly, without waiting for responses from external components. This can improve the overall performance of the system and reduce latency.

Fault tolerance: Asynchronous communication allows the system to handle errors and failures in a more robust and fault-tolerant way. If an external component fails to respond or times out, the system can retry the request or send it to a backup component.

Loose coupling: Asynchronous communication promotes loose coupling between internal and external components, making it easier to modify or replace components without affecting the rest of the system. This can make the system more flexible and adaptable to changing business requirements.

Asynchronous processing: Asynchronous communication enables the system to perform background processing of payments and transactions, freeing up resources for other tasks. This can reduce the amount of time it takes to complete a transaction and improve the overall efficiency of the system.

However, in some use cases, we cannot proceed without a response from the upstream service.

For instance, physical stores payment requires real-time authorization from the API. We should know immediately if it’s a success or failure.

But, we should use synchronous communication only if there is no other way.

In most cases, we should prefer asynchronous communication.

Dealing with Payment Failures

n payment system, we can encounter at least the following kind of issues:

System failures 💥- here we have the usual network, and server failures
Poison pill errors 💊- when an inbound message cannot be processed, or consumed
Functional bugs 🐞- where there’s no technical errors, but results are invalid

So, implementing a reliable payment system comes with a lot of challenges.

The good news is that we have many tools at disposal to deal with the impediments.

Guarantee transaction completion ✅

To guarantee transaction completion we can use a messaging queue, like Apache Kafka.

For any order placed or payed we also create an order event in Kafka.

This component will help us persist communication messages, so that they are not lost even when things don’t go as planned.

In this case, the payment operation does not complete successfully, until an event is safely stored in this message queue.

Now, we might consider that Kafka can also fail.

However, since its job is so simple, at this point is just to store messages, its availability is normally much higher than business related services. (99.999%)

Dealing with Transient Failures

Retry Strategies

A customer may try to make a payment, but the request fails due to unstable network connection.

In those cases, it makes sense to retry the operation because network problems are usually temporary and on the second or third attempt the request might succeed.

Retry Strategies:

Immediate: The most basic retry implementation is to retry immediately after a failure. However, is unlikely that the issue has been solved, in such a short amount of time.

Furthermore, it’s important to give a bit of break to the called service to recover if it was down.

Otherwise, we can waste computing resources, and also overload the system.

So, we can retry at fixed intervals of time or better yet at incremental intervals of time.

Now the system has a bit of break to recover.

Still, a more advance retry strategy is to use: Exponential backoff retry. Here, we double the waiting time between retries after each failed retry. 2^n.

Timeout Pattern ⏱

First, we should set the timeouts at a level that balances allowing for slower responses with avoiding waiting indefinitely for a response.

However, there are other challenges when dealing with timeouts in payment systems.

When a request times out, it is treated as failed, but this can lead to issues with double charging or incorrect status.

To avoid these issues we can use idempotency together with retry strategies.

Fallbacks

The Fallback pattern allows a service to continue execution even if requests to another service fail, by filling in a fallback value.

This compromise between risk and customer satisfaction can be used to avoid losing customers, but if a fallback value is not acceptable, other solutions must be considered.

Dealing with Persistent Failures

If the error is due to incompatible information, it should be saved for later debugging. This can be done by isolating problematic messages in a dead letter queue.

If the error is due to a service being down, the transactions can be stored in a persistent queue until the service recovers and can process them.

Idempotency

An idempotent operation has no additional effect if it is called more than once with the same input parameters.

To avoid double payments, an idempotency key is generated at the client and added to the HTTP header.

UUIDs are commonly used as idempotency keys, and the same key is used as the ID of the payment order.

To support idempotency, the unique key constraint of any database can be used. This ensures the "Exactly Once" guarantee and prevents multiple concurrent requests with the same idempotency key.

Security

Enforce Encryption for Data-at-Rest and Data-in-Transit

One of the most effective ways to protect data is encryption.

Encrypt data at rest, i.e. convert it into a secure format that needs a key to be read. This can be done with software tools for disk or database encryption.
Encrypt data in transit over a network, such as the internet. Use a VPN for secure and encrypted connections between a device and a network, and TLS for confidentiality, data integrity, and authentication between two parties.
Use access controls to restrict data access only to authorized users. Use methods such as two-factor authentication to verify user identities.
Regularly update software, libraries, and the operating system to avoid known software vulnerabilities.
Back up data regularly to ensure it can be recovered in case of loss or damage.
Use long, complex passwords that are difficult to guess or crack. Avoid common passwords like "password" that can be easily guessed by attackers using a precomputed table of reversed password hashes called a rainbow table.

Data Integrity Monitoring

It's important to monitor data integrity by regularly checking for changes in vulnerable data and generating security alerts.

This technique can help detect malware and other security threats.

However, it can be resource-intensive, so it's important to focus on monitoring the most vulnerable and confidential data, such as user credentials and encryption key stores.

Conclusions

For a payment system, reliability and fault tolerance are key requirements.

To tackle these requirements, we discussed how to make use of the following tools but not only:

Redundancy - to enables resilience during internal system failures
Patterns for payment guarantee - by using Kafka capabilities to persist messages, so that they are not lost even if the messaging system crashes
Strategies for retry, timeouts and fallbacks - to make the systems robust and predictable
Message Queues - to avoid overloading the services
Idempotent message handling - to allow clients to retry requests as needed. Not doing so, could leave data in an inconsistent state, or worse, in double payments.