Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are joins bad when considering scalability?

Tags:

sql

join

Scalability is all about pre-computing (caching), spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.

In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.

But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.

The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. There are a number of scalability-friendly ways to take raw joins and make them much faster:

  • Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation
  • Indexes
  • Materialized/indexed views (think of this as a pre-computed join or managed de-normalization)
  • Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.
  • Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)
  • OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as generic denormalization)
  • Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.
  • Use of a caching layer like Redis to avoid re-running queries which need complex joins.

I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).

If you have a slow join, you're probably not using your database correctly.

De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.

* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.


Joins can be slower than avoiding them through de-normalisation but if used correctly (joining on columns with appropriate indexes an so on) they are not inherently slow.

De-normalisation is one of many optimisation techniques you can consider if your well designed database schema exhibits performance problems.


article says that they are slow when compared to absence of joins. this can be achieved with denormalization. so there is a trade off between speed and normalization. don't forget about premature optimization also :)


First of all, a relational database's raison d'etre (reason for being) is to be able to model relationships between entities. Joins are simply the mechanisms by which we traverse those relationships. They certainly do come at a nominal cost, but without joins, there really is no reason to have a relational database.

In the academic world we learn of things like the various normal forms (1st, 2nd, 3rd, Boyce-Codd, etc.), and we learn about different types of keys (primary, foreign, alternate, unique, etc.) and how these things fit together to design a database. And we learn the rudiments of SQL as well as manipulating both structure and data (DDL & DML).

In the corporate world, many of the academic constructs turn out to be substantially less viable than we had been led to believe. A perfect example is the notion of a primary key. Academically it is that attribute (or collection of attributes) that uniquely identifies one row in the table. So in many problem domains, the proper academic primary key is a composite of 3 or 4 attributes. However, almost everyone in the modern corporate world uses an auto-generated, sequential integer as a table's primary key. Why? Two reasons. The first is because it makes the model much cleaner when you're migrating FKs all over the place. The second, and most germane to this question, is that retrieving data through joins is faster and more efficient on a single integer than it is on 4 varchar columns (as already mentioned by a few folks).

Let's dig a little deeper now into two specific subtypes of real world databases. The first type is a transactional database. This is the basis for many e-commerce or content management applications driving modern sites. With a transaction DB, you're optimizing heavily toward "transaction throughput". Most commerce or content apps have to balance query performance (from certain tables) with insert performance (in other tables), though each app will have its own unique business driven issues to solve.

The second type of real world database is a reporting database. These are used almost exclusively to aggregate business data and to generate meaningful business reports. They are typically shaped differently than the transaction databases where the data is generated and they are highly optimized for speed of bulk data loading (ETLs) and query performance with large or complex data sets.

In each case, the developer or DBA needs to carefully balance both the functionality and performance curves, and there are lots of performance enhancing tricks on both sides of the equation. In Oracle you can do what's called an "explain plan" so you can see specifically how a query gets parsed and executed. You're looking to maximize the DB's proper use of indexes. One really nasty no-no is to put a function in the where clause of a query. Whenever you do that, you guarantee that Oracle will not use any indexes on that particular column and you'll likely see a full or partial table scan in the explain plan. That's just one specific example of how a query could be written that ends up being slow, and it doesn't have anything to do with joins.

And while we're talking about table scans, they obviously impact the query speed proportionally to the size of the table. A full table scan of 100 rows isn't even noticeable. Run that same query on a table with 100 million rows, and you'll need to come back next week for the return.

Let's talk about normalization for a minute. This is another largely positive academic topic that can get over-stressed. Most of the time when we talk about normalization we really mean the elimination of duplicate data by putting it into its own table and migrating an FK. Folks usually skip over the whole dependence thing described by 2NF and 3NF. And yet in an extreme case, it's certainly possible to have a perfect BCNF database that's enormous and a complete beast to write code against because it's so normalized.

So where do we balance? There is no single best answer. All of the better answers tend to be some compromise between ease of structure maintenance, ease of data maintenance and ease of code creation/maintenance. In general, the less duplication of data, the better.

So why are joins sometimes slow? Sometimes it's bad relational design. Sometimes it's ineffective indexing. Sometimes it's a data volume issue. Sometimes it's a horribly written query.

Sorry for such a long-winded answer, but I felt compelled to provide a meatier context around my comments rather than just rattle off a 4-bullet response.


People with terrabyte sized databases still use joins, if they can get them to work performance-wise then so can you.

There are many reasons not to denomalize. First, speed of select queries is not the only or even main concern with databases. Integrity of the data is the first concern. If you denormalize then you have to put into place techniques to keep the data denormalized as the parent data changes. So suppose you take to storing the client name in all tables instead of joining to the client table on the client_Id. Now when the name of the client changes (100% chance some of the names of clients will change over time), now you need to update all the child records to reflect that change. If you do this wil a cascade update and you have a million child records, how fast do you suppose that is going to be and how many users are going to suffer locking issues and delays in their work while it happens? Further most people who denormalize because "joins are slow" don't know enough about databases to properly make sure their data integrity is protected and often end up with databases that have unuseable data becasue the integrity is so bad.

Denormalization is a complex process that requires an thorough understanding of database performance and integrity if it is to be done correctly. Do not attempt to denormalize unless you have such expertise on staff.

Joins are quite fast enough if you do several things. First use a suggorgate key, an int join is almost alawys the fastest join. Second always index the foreign key. Use derived tables or join conditions to create a smaller dataset to filter on. If you have a large very complex database, then hire a professional database person with experience in partioning and managing huge databases. There are plenty of techniques to improve performance without getting rid of joins.

If you just need query capability, then yes you can design a datawarehouse which can be denormalized and is populated through an ETL tool (optimized for speed) not user data entry.


Joins are slow if

  • the data is improperly indexed
  • results poorly filtered
  • joining query poorly written
  • data sets very large and complex

So, true, the bigger your data sets the the more processing you'll need for a query but checking and working on the first three options of the above will often yield great results.

Your source gives denormalization as an option. This is fine only as long as you've exhausted better alternatives.


The joins can be slow if large portions of records from each side need to be scanned.

Like this:

SELECT  SUM(transaction)
FROM    customers
JOIN    accounts
ON      account_customer = customer_id

Even if an index is defined on account_customer, all records from the latter still need to be scanned.

For the query list this, the decent optimizers won't probably even consider the index access path, doing a HASH JOIN or a MERGE JOIN instead.

Note that for a query like this:

SELECT  SUM(transaction)
FROM    customers
JOIN    accounts
ON      account_customer = customer_id
WHERE   customer_last_name = 'Stellphlug'

the join will most probably will be fast: first, an index on customer_last_name will be used to filter all Stellphlug's (which are of course, not very numerous), then an index scan on account_customer will be issued for each Stellphlug to find his transactions.

Despite the fact that these can be billions of records in accounts and customers, only few will actually need to be scanned.