5 Database Models

What database should I choose?

Choosing the data model for an application is essential, as it can have a profound impact on performance, scalability, and maintainability.

While relational databases are still the most popular, alternative models such as graph or wide-column databases may be better suited for certain scenarios.

The document aims to break down the different types of databases and their pros and cons to help readers make an informed decision.

Key-Value Database 🐇

Core Properties

Flexible for Unstructured Data

Imagine we have a large amount of data, which is unstructured.

To this data we can assign a set of unique identifiers, which allows us to create a collection of key-value pairs that can be quickly accessed.

This model is flexible enough to adapt to different types of data and still be efficient.

Fast Lookup

This type of database implements a hash table to store unique keys, along with the pointers to the corresponding data values.

Since this data-structure is basically an index, it is very fast and efficient for data retrieval.

It uses a hash function to quickly calculate the location for storage, based on the key.

Then, it uses the same key to quickly locate the corresponding value in memory in constant time.

In-Memory Database

Key-value databases are well-suited for in-memory storage due to their simple data structures, while more complex models like relational and document-based databases may require more memory and processing power.

However, even in-memory storage has limitations, and mission-critical apps may still need to persist data on disk.

Additionally, the size of the data determines how fast it can be retrieved, so smaller data sets are better suited for in-memory storage.

Antipatterns

Not for Complex Data Structures

Simply put, Key-value stores are not designed for Complex Data Structures.

So, If you need execute dynamic queries or perform complex aggregations based on multiple tables, then you should look at document or a relational databases.

Not for ACID transactions

Key-value databases like Redis are designed for high performance and horizontal scalability, rather than strong transactional consistency.

Although, Redis supports executing multiple commands as a single atomic transaction using Multi-Command Transactions, or Lua Scripting.

But it doesn’t support the full ACID by default.

It requires some tricks and configurations to reach ACID properties, and they will usually come with trade-offs.

Not for Historical Data

And finally, Key-value are not well suited for data warehousing.

This is because they are not designed to store large amounts of historical data.

And, they don’t normally provide features such as data compression and indexing.

Superpower 💪

Caching

Traditional SQL databases were designed for functionality, rather than speed at scale.

So, a cache is often used to store the replies of costly queries from the relational database to reduce latency and significantly increase throughput.

Caching is all about quickly accessing frequently used data, and key-value stores are perfectly designed to do just that.

Key-value stores are perfect for caching, because they can quickly retrieve data using a unique key, rather than searching through a large dataset.

Also, key-value stores allow for many data types as value, including linked lists and hash tables.

Wide-column Database 🎡

Core Properties

Column layout

These databases store data in column families.

Although they look similar to the tables of a relational database, they are not actually tables.

We’ll realize this when we’ll try to make a query on a random attribute, and we won’t be able to do it.

This is because we can search only by using the primary key (partition key), similar to the key-value stores.

So this model is not optimized for analytical queries, that require filtering across multiple columns, tables, joins or aggregation.

Primary Keys

A primary key consists of one or more partition keys, and zero or more clustering keys, sometimes called sort keys.

For instance, in Cassandra each dataset is partitioned by a partition key, which is a combination of one or more columns.

Basically, we have a tool integrated in our datamodel to split our dataset, and distribute it to multiple nodes.

Here, the partition key is used to distribute data on multiple partitions or nodes.

And the clustering key is used to sort data, within a partition.

Wide-column databases are highly partitionable, and allow for horizontal scaling at a magnitude**,** that other types of databases cannot achieve.

Denormalized

Wide column databases are storing data in denormalized form.

This means that all data related to a particular item is stored together in a single row, rather than being spread-out across multiple tables.

This allows for faster data retrieval, and easier querying,

You don't have to flip back and forth between multiple tables, and do joins to get all the information you need.

It's all in one place.

However, this will be at the cost of potentially having some `duplicated data.~

And duplicating data is the root, of all data inconsistencies, among other problems we’ll see next.

Antipatterns

Not for random filtering and Rich queries

Looking for a specific piece of data in a large cluster can be like finding a needle in a haystack.

Queries can be slow when scanning a full table, let alone hundreds of tables.

To avoid this, we can make the search column a partition key.

This creates a new table for each query pattern, resulting in duplicate data.

Although wide-columns databases are fast for writes, they are not the best option for filtering or analytics queries.

Not for Transaction Processing

For Transaction Processing, consistency is crucial.

However, wide-column databases are only eventually consistent by default.

This means that the data will eventually be consistent across all nodes, but not at the same time, making transactions more expensive in terms of latency and availability.

That's why Cassandra and other wide-column databases offer light-weight transactions, which are still quite expensive in multi-node environments.

Therefore, wide-column databases are not the best option for ACID transactions.

Superpower 💪

High scalability

Wide column superpower is its ease of scaling horizontally through the addition of new nodes.

Cassandra uses consistent hashing and virtual nodes to distribute data evenly across the cluster when new nodes are added, minimizing the amount of data that needs to be moved.

This makes it possible to scale horizontally as much as needed, as demonstrated by Apple's use of thousands of Cassandra clusters with hundreds of thousands of nodes and petabytes of data storage for multiple use cases.

Optimized for Writes

Wide-column databases use a write-optimized storage architecture that allows for fast writes, and its partitioned architecture allows for writes to be executed in parallel on multiple nodes simultaneously.

Document Database 📑

Core Properties

Denormalized

A document database stores all the information related to an entity in a single document, avoiding the need to join data across multiple tables.

This approach is known as denormalization and is used in databases like MongoDB.

While it's a more convenient way to handle data, it can lead to data duplication and inconsistencies if not managed carefully.

To avoid issues, it's important to choose the right use case for a document database.

If you have many relations between different entities, this approach may not be the best option.

Handle Unstructured Data

Storing data in any format speeds up development by eliminating the need to define schemas and create tables.

However, without proper constraints, maintaining consistency in data across different documents can be difficult, limiting the types of queries that can be performed.

For complex use cases, it's important to carefully consider data modeling and ensure the presence of appropriate indexes and constraints.

Indexing and Rich Query

Document databases have advanced indexing capabilities, supporting different types of secondary indexes such as simple, compound, geospatial, unique, or full-text indexing.

Without proper indexes, MongoDB may suffer from poor performance, especially when working with large sets of data.

Indexes also enable complex queries on large amounts of data, making it a powerful datamodel.

Antipatterns

Not for Complex joins and relationships

If you deal with many complex relationships, a document database may not be the best choice.

Document databases, such as MongoDB, recommend embedding documents instead of using one-to-many or one-to-one relationships as a general rule, unless there is a compelling reason not to do so.

Document databases were not designed to handle complex joins and relationships like relational databases, so even if you can model some relationships in a document data model, you will not have the same level of features and integrity.

Joining data from multiple tables can be a resource-intensive operation and as the data size grows, join operations become more expensive.

This can also have a significant performance impact on the entire database system of a highly scalable system like MongoDB.

Referential integrity

Furthermore, maintaining data consistency between related entities can be a difficult in a document database, as there’s no enforced referential integrity, and changes to one document may not be reflected in other related documents.

Superpower 💪

Most intuitive

Document-oriented databases are a great fit for object-oriented programming.

They use a format similar to the data structures used in object-oriented programming languages, like JSON or BSON, which allows for easy integration between the database and application code without translation.

This is not the case with object-oriented programming and relational databases, which have struggled to integrate.

Relational Database 👑

Core Properties

Mature and formalized datamodel

Relational databases have been the top choice for data storage for decades and remain popular despite the emergence of alternative databases like NoSQL.

They are especially prevalent in finance and e-commerce due to their following advantages.

Ubiquity of relationships

All data in most applications is relational.

Customers make orders, orders contain products, and products are found in stores, and so on.

Furthermore, the relational model, with its tables, rows, and columns, provides a clear and straightforward way to model data, making it easy for developers to work with.

Normalization

To use relational databases, you must first model your data according to strict normalization rules.

This involves organizing data into separate tables to reduce redundancy and improve integrity.

It's like organizing your closet! However, data integrity is not always easy to achieve, especially when dealing with hundreds of concurrent transactions.

Antipatterns

Difficult to scale horizontally

Scaling a relational database horizontally can be difficult.

Solutions such as replication and sharding exist, but they add complexity in terms of infrastructure and administration.

Partitioning the data is necessary to scale a database, but this can break the relationships between tables, making it difficult to maintain data consistency and integrity.

If you need to store large amounts of less structured data, a NoSQL database may be more appropriate.

Superpower 💪

ACID

This document discusses the importance of ACID guarantees in choosing a database, and why relational databases are still considered the best option due to their structure of tables and relationships.

While other database models may also support ACID properties, they may struggle to ensure consistency and isolation, especially in a distributed system.

The trade-off for strong consistency is the difficulty in scaling.

Graph Database 🍇

Core Properties

No need to compute the relationships at query time

A graph database stores data as a connected graph, with nodes representing entities like tweets, users, and tags, and edges representing relationships like "follows" or "mentions".

To get the top 10 tags used in all messages by a certain user, a relational database would require a join between the Tags and Tweets tables, resulting in a separate table.

However, in graph databases, relationships between nodes are stored directly on the nodes, so there's no need to compute relationships during query time.

This makes queries with densely connected data much faster and eliminates the need for expensive JOIN operations, making data maintenance easier.

Handles Complex Data Structures

This model is powerful enough to cover complex data structures. Neo4j was used to build a Knowledge Graph at Nasa.

Managing and maintaining a graph database requires expertise. Learning and managing graph databases can be challenging, especially with large and intricate graphs.

Be prepared to invest time and effort to get up to speed.

Antipatterns

Difficult to scale

Graph databases are difficult to model on a single node.

When distributing the graph on multiple nodes, several factors need to be considered, such as distributing edges and balancing the graph data evenly.

Additionally, concerns about node failure, dynamic node addition/removal, data consistency, transactional integrity, maintaining the connected structure of the graph, and query performance across partitions must be addressed.

Not for Write-heavy workloads

Graph databases optimize for querying relationships, but may not be ideal for high write-heavy workloads due to the need to write to multiple nodes in parallel.

Maintaining the graph structure across nodes adds overhead that slows down scaling and reduces write throughput, while also increasing the risk of data inconsistency and conflicts.

Consider using key-value or columnar databases for write-heavy loads instead.

Graph databases can become unwieldy when dealing with complex relationships.

Be prepared to invest in hardware resources to use them effectively.

Superpower 💪

Multi-hop relationships

Graph databases have performance benefits, especially when dealing with complex, multi-hop relationships between entities.

In a Data Center scenario, it may require traversing several relationships to find all switches and interfaces.

A graph database can achieve this in a single traversal, making queries faster and more efficient compared to traditional relational databases.

The latter stores relationships between entities as foreign keys in separate tables, requiring expensive join operations to traverse relationships.

This often results in slow and complex queries, especially when dealing with densely connected data.