A while ago I got the chance to get lessons on NoSQL. I already had lessons and some experience with SQL in the past, so here I’d like to share in short my current knowledge about these two concepts. Note that I liberally use the word query in this article both for querying data (i.e. select), as well as changing data (i.e. insert, update, delete).
SQL
SQL is short for Structured Query Language, and is often pronounced “sequel”. It’s a language to query a so called Relational Database. When people talk about a SQL database, they mean a Relational Database that you can query using the SQL language. The database model is mostly aimed at being consistent. The database consists of tables with columns and rows. Columns tell you what type of data we have, and new data is added by adding rows. You can visualise it by thinking of spreadsheets. There can be relations between different tables, they are defined by saying what columns in different tables correspond with each other. There are so called one-to-one relations, one-to-many relations, and many-to-many relations. Typically you can group tables together in a so called schema. It’s not always useful to use this, so in many cases people just put all tables from the database in a default schema. In PostgreSQL the default schema is called public
.
To give a simplified example of how a SQL database can look;
Let’s say we have a social network platform with users who make posts and they can like posts. We want to store that in a SQL database. We can have a table USERS
, a table POSTS
, and a table LIKES
. The USERS
table can have a column username
. Maybe there’s also other columns, like profile
for profile information, email
and passwordhash
for authentication, etc.
Then we have the POSTS
table who needs a column content
, user
, and created_time
.
A table will generally have a so called “primary key”. This is a unique value that we can use to uniquely identify the correct row. Often this is a single field, but it’s possible to use a combination of fields to form a so called composite primary key. Let’s say we use username
as the primary key for USERS
. We currently don’t have a unique key for POSTS
, but you can always add an extra column for that, so we will add a column id
for POSTS
.
Finally we have the LIKES
table who needs the user who did the like in a user
column, and a post
column containing the id
of the post that was liked.
Columns also have a type and it’s possible to set constraints on columns like saying that they must be unique or can’t be NULL
.
A post is always made by a user, that’s a one-to-many relation between the USERS
and POSTS
table. We can define this relation by stating that POSTS.user
is a foreign key corresponding to USERS.username
. Meanwhile a like is made by a user, and corresponds to a post. These relations can be defined by saying that LIKES.user
is a foreign key corresponding to USERS.username
, and LIKES.post
is a foreign key corresponding to POSTS.id
. The primary key for the LIKES
table can be a composite primary key of user and liked post.
To create these, it could look something like
create table if not exists public.users(
username varchar primary key,
profile varchar
);
create table if not exists public.posts(
id serial primary key,
username varchar,
content varchar,
created_time timestamp default current_timestamp not null,
foreign key (username)
references public.users (username)
);
create table if not exists public.likes(
username varchar,
post serial,
foreign key (username)
references public.users (username),
foreign key (post)
references public.posts (id),
primary key (username, post)
);
NoSQL
Depending on implementation, a NoSQL database can also use the SQL language for querying. The difference is not so much the language to query the database, but rather the underlying technology. While the data is strongly defined in SQL by using columns and types, a NoSQL database doesn’t have that. It consists of so called containers that store JSON objects, and there is no check on the content of the JSON object. This JSON object is also called a document in this context, which is why people sometimes refer to this type of database as document storage. It’s not about documents like you have in a file system, but storage of JSON objects. In a way you can see the containers as being similar to SQL tables and the documents similar to the rows in SQL. You can also define a partition key per container. The partition key is a field that you expect in each of the JSON documents of the container, and allows the underlying database system to split the container up over different servers. This can reduce load when querying on a specific partition key.
If we take the previous example, and we decide that we generally want to fetch all the likes corresponding to a post, then the users can be put into a container users
, and posts and likes can be put together in another container post_actions
(feel free to come up with a better name). We have to make sure that the field we query on has the same name, so a post can look something like {post_id: 123, content: "blablabla", created_time: "2023-08-16 10:06:23+00:00", user: "Alice"}
while a user can look like {username: "Alice", profile: "Hey, I'm Alice, let's be friends <3"}
, and if Alice likes her own post, it could look like {user: "Alice", post_id: 123}
.
Now we can query select * from post_actions where post_id = 123
and we immediately have the post and all likes corresponding to it. And if we add a partition key on post_id
, or some function based on it, we won’t even need to query all partitions, only the one where this post is stored. When done right, this can be a very powerful setup.
Scaling
While SQL was mostly aimed at consistency, NoSQL is mostly aimed at horizontal scaling.
When we talk about vertical scaling, it often means that, to allow more load, you have to use a heavier server. More RAM, faster CPU, larger and faster disks… This is the case for a SQL database. Horizontal scaling on the other hand allows you to scale by adding more servers instead of having to make them heavier.
In NoSQL you will mostly group data according to how it’s queried, and, unlike with SQL who wants to be consistent, duplication of data is allowed. As we saw in the example, one simple select can often give you a full set of data of different types, and a partition key allows you to split up the container so that one container can be split up and stored across multiple servers, while only having to query the one partition, and thus spreading load over the different servers. Another reason why NoSQL can scale better, is because it does less. For example, it doesn’t check for correct data types or consistency. It’s up to the application to do that, and it’s up to the developer to make sure their application can properly scale.
SQL normalisation
When you have an application and you need a database, you need a way to represent the data correctly. There’s a whole study on how to properly do that. In SQL we call this process normalisation. Here’s a short list of steps to take, but note that for most applications these steps are not followed to the letter. Sometimes because a developer sees a reason to deviate from it, but a big reason is that frameworks take over a lot of the logic, and they may have reasons to do things in slightly different ways.
- For each form/page you have, list the data. Choose good names for each piece of data, group repeating parts (e.g. a profile with posts has the data of the profile, but posts can be grouped as repeating parts).
- Now we isolate the repeating parts in their own separate entity. One attribute should be the primary key of the parent (e.g. for posts, we require the username of the person who made the post as attribute) and you should find a second unique attribute. Together they can form the primary key.
- Check for attributes with composite primary keys for attributes who are not functionally dependent from the primary key. These should also be made their own entity with their own primary key. (E.g. login data can be considered functionally different from the username they post with)
- Bring attributes together who are functionally dependent from non-key attributes. These should also be brought into it’s own entity, and the attribute that identified this entity becomes the primary key.
- Integrate the normalised data groups. This is mostly choosing consistent naming for everything and group entities together where you’re talking about the same underlying data.
- Then we determine the relations between the entities. A relation in SQL is always 1-to-1, 1-to-many, or many-to-many. A user has multiple posts, so that’s a 1-to-many relation. In SQL we show this by making sure the primary key from the user is a foreign key attribute in the post. For a many-to-many relation, we add a new table with records containing the primary key from one entity and the primary key from the other.
Note that the example we used earlier was mostly to show some principles, and may not be a good example of a properly normalised model.
NoSQL modeling
While NoSQL is much newer, there are similar steps you can take to create a proper model of your data. Here it’s less important to force consistency, but more important to keep queries cheap. You can take the word cheap very literally. Each query takes cpu, memory, time… All of this costs money. As such, it’s possible to estimate how much a certain execution costs. This cost is what we want to optimise for. Here are steps you can take to model your database;
- look at the normalised (i.e. SQL) model, including the relations
- In a normalised model, we want to put things in different tables if they are functionally different. In NoSQL that isn’t the case. We want to group things according to what will be cheapest when querying. When we have a relation, we check if we should embed or reference
- Typical cases where you embed are 1:1 or 1:few relations who are read and updated together
- Referencing (i.e. put the data in different containers) is typically done with 1:many or many:many relations when you read or update separate
- Note that there may be cases where we both reference and embed, see later
- Choose a partition Key
- This must obviously be a name of a key in our document
- You should not have too many documents per partition and you should have a nice spread regarding both storage and request. When the partitions are not nicely balanced like that, you have a so called “hot partition”.
- Depending on technology, the maximum JSON object and partition size may both have a limit.
- Maybe counterintuitive, but there is no limit regarding the amount of partitions, so one document per partitions is OK. Many partitions like this is also referred to as having a “high cardinality”.
- What partition key to use, depends on the most important requests.
- If the data you generally select on is not part of the data, because you normally join, you can embed the data but also keep the data in a separate container, and make sure that updates propagate.
- When you don’t really have a key you query on, you can add a new key-value with fixed value, e.g. {type: “”}. This is OK, because it’s probably only intermediate, see next step.
- When different containers have the same partitionKey, move the data to one container. This step shows how completely different data can be stored in one container.
- Lastly, add fields to keep queries cheap. Expensive (e.g. cross partition) queries should only run rarely. Think once per week, rather than multiple times per day. This is typical for things like
count(*)
and you need to make sure your software updates these fields in the same transaction of the corresponding change.
I’m an expert now!
This is just a short introduction, and there are many peculiarities that you only learn to see and properly handle by experience. I think the important difference is that in SQL you try to optimise for consistency, while in NoSQL you try to optimise for keeping the queries cheap, even for huge amounts of data. If you optimise for the correct property, you should be fine. I think it’s also important to note that while you’re supposed to optimise for being cheap in NoSQL, it is therefor not always cheaper than SQL. NoSQL is good for huge amounts of data where vertical scaling becomes hard or impossible. But it also requires more work from your program to keep things consistent. Whether SQL or NoSQL makes more sense for your project, is something you’ll need to decide yourself.
Good luck!
Comments
No comments yet. Be the first to react!