Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL databases: normalization vs. performance?

For a project, I was asked to look at an existing SQL database and to see if it could be improved. It was basically a customer database with a bunch of different types of data per customer. This is (basically) how it was organized:

Each customer had a row in the customer table with a customer ID. Then for each type of data, each customer had its own table. So, for instance, there would not be one central table for "jobs", with a customer ID in each row, but for each customer there would be a jobs table called "jobs1234" (1234 being a customer ID.

Now, my first response was confusion as to why you would organize it like that. I've always just learned that it's always better to normalize without really thinking beyond that point. But when I discussed it with people, a few pointed out it may have been for performance reasons. They said that if there were too many rows for "jobs", it would be better to have them split up per customer than to have them all in one table.

Something about indexing and the customer ID being the identifier. I'm confused as to why this approach would improve performance and haven't really gotten a very clear answer so far. Can anyone explain to me why that's the case and if it's even true that this approach is better in some cases?

like image 790
vanamerongen Avatar asked Jul 27 '13 17:07

vanamerongen


1 Answers

I find this statement rather shocking:

They said that if there were too many rows for "jobs", it would be better to have them split up per customer than to have them all in one table.

Databases are designed to have tables that have lots and lots of rows -- millions of rows should be no problem. You don't specify what the volume of data is, but with a name like jobs, I'd be surprised if the total volume exceeds a few million rows in total. For this volume of data, a single table with suitable indexes should be fine.

There are cases where splitting data by customer would make sense. The strongest case is when it is an explicit requirement, typically for security reasons. In other words, the clients are promised that "their data is never mixed with anyone else's data". And, in most databases (MySQL included), it is easier to deal with security at the table level than at the row level.

Another possible reason would be when the tables have different formats, reflecting different data for each customer. In this case, you would really be dealing separate applications, and each customer should have their own database.

Are there any the downsides to splitting the customer data into multiple tables per customer? Yes. Here are some:

  1. You cannot write generic queries/views to access the data. Basically, all queries in the code need to by dynamic, so you can put in the right table name.
  2. Maintaining the data becomes cumbersome. Instad of updating a single table, you have to update multiple tables.
  3. Answering questions such as "How many jobs does each customer have?" or "What is the growth in the number of jobs over time?" become so difficult to answer that people probably won't even bother asking them.
  4. Performance is a mixed bag. Although you might save the overhead of storing the customer id in each table, you incur another cost. Having lots of smaller tables means lots of tables with partially filled pages. Depending on the number of jobs per customer and number of overall customers, you might actually be multiplying the amount of space used. In the worst case of one job per customer where a page contains -- say -- 100 jobs, you would be multiplying the required space by about 100.
  5. The last point also applies to the page cache in memory. So, data in one table that would fit into memory might not fit into memory when split among many tables.

Partitioning is one way to implement something similar. However, this would work best when the query load is focused on one customer at a time. If all customers are accessing the data at the same time, then partitioning is going to be less of a win, and indexing should be sufficient.

Unless there is a really good reason for splitting the data into separate tables (a requirement, cumbersome security for each client, or custom formats for each client), you simply would not take that approach. Even when there are reasons for doing it, there are often other solutions (such as partitioning) that solve the same problem.

like image 97
Gordon Linoff Avatar answered Sep 22 '22 17:09

Gordon Linoff