Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where to do joins - in the database server or in the application server?

I am facing a performance (it can lead to scaling issue later on) issue at the moment. The application I am working on is quite complex and it’s running on SQL Server 2005. I need to join 6 - 7 tables to get the desired data. Each table contains more than 100,000 rows of data so far. The database schema cannot be changed (must stay as is). So I can only try to optimize as much as possible. 2 things come to my mind:

  • Try not to join in database and let the application server do the filtering by using LINQ:

    • Pros: will be able to scale easily by adding more app servers.
    • Cons: more effort; I am not sure if responsiveness will be decreased.
  • Application server stays as is and try to optimize the SQL query as much as possible (more indexes, rebuild index frequently etc):

    • Pros: minimum effort
    • Cons: when table records get bigger the problem will come back again

Basically caching isn't an solution for me at the moment (hardware issue, hosting issues etc) and that's why I didn't bring it up originally. But I do know what the benefits of caching will bring to me and have used it many times.

like image 800
Jeff Avatar asked Dec 22 '22 12:12

Jeff


2 Answers

Generally speaking, do the joining in the DBMS. If you do it in the application server, you are betting that you can do a better job of optimizing joins than the people who wrote the DBMS, and (further) that you can out-perform their best efforts by enough to offset the cost of transferring the unjoined data across the wire.

Now, if you are going to do a cross-product of two wide tables (lets say that they are T1, with N1 rows of width W1 and T2 with N2 rows of width W2) with no filtering, then the DBMS is obliged to create and send N1 * N2 * (W1 + W2) bytes of data over the wire, whereas you could suck down the tables separately as N1 * W1 + N2 * W2 bytes of data. If N1 = N2 = 1M and W1 = W2 = 100, then that's 200 TB vs 200 MB of data transfer in favour of doing the cross-product in the app server. But that's not exactly fair to the DBMS. Most queries are not that silly - they join on columns and apply conditions, and the DBMS optimizer will struggle mightily (and automatically) to minimize the work done. Further, it will only send the pertinent data back to you; it does not have to send all the rows that don't match your criteria.

To show an alternative scenario (in favour of the DBMS) consider a case where T1 has N1 = 1M rows of width W1 = 100, but T2 has N2 = 100K rows of width W2 = 50. There is a join between the two tables on an integer column, and there are, therefore, 10 rows in T1 for each on in T2. Suppose that you suck down all of T1 and T2 to the app server: that requires N1 * W1 + N2 * W2 = 105 MB of data. But the filter conditions limit the data to 1/10 of the rows in T2 and for each row in T1 that matches a row in T2, there is in fact only 2 rows that match the filter conditions. Now the DBMS is only going to transfer N2 * (W1 + W2) / 5 = 3 MB, a saving of over 100 MB of data transfer by the DBMS. Now, if you manage to be clever and download just the N2 * W2 / 10 = 500 KB of data that corresponds to the values in T2, you still have to get the DBMS to do the 'semi-join' of T1 on the values you want to get the right rows from T1 to the app server. If you only need a subset of the columns, there can be another set of savings. And DBMS tend to have rather clever sort packages; you'll need a good sort package in your app server to present data in the correct order.

It should normally be a hands-down win for joins in the DBMS. If it isn't, it is because you are asking the server to do more work than it can handle. In that case, you need to look at whether replicating the database server makes sense, or whether adding more cores, or more network bandwidth, or more main memory will do the job.

like image 183
Jonathan Leffler Avatar answered Dec 27 '22 01:12

Jonathan Leffler


In general, I consider a number of points when talking about scale:

  1. How often does it get executed? For less frequently accessed queries you may be able to accept some degradation in performance.

  2. What's the rate of growth / change? If the records are relatively static in some of those tables, you may want to consider caching the contents externally in a dbm-type of file (or whatever the windows equivalent is). There's also things like memcache that might be worth looking at. This may or may not be possible, though. This is predicated on performing "joins" in application code.

  3. Profile. If you're joining on indexed columns (and you are, aren't you?), you're not necessarily going to degrade as the number of rows grow. This is going to depend heavily on whether you're dealing with 1:1 or 1:N relationships, what the average size of N is, how much available memory you have available on the database server, how often your table statistics are computed, and the type of columns and indexes. If you're dealing with a 1:1 relationship and it's unique, the database is going to be able to do a simple hash and look up.

Make sure you limit the columns fetched to absolutely no more than what you need, especially when joining many tables, because if all that is required to join two tables are the columns that are indexed, the database may not even consider the table at all; the join can be performed using just the indexes. This reduces contention and improves the performance of less optimal queries that need to deal with the actual contents of the table because there's fewer queries pulling on the table.

All relational databases have a tool or feature to view the query execution plan for the given query. Use it. If the output doesn't make sense to you, learn it. This is your primary window into understanding what the database will do with a given query, what indexes will be used, what the estimated (or actual) number of rows that will be encountered in each step of execution, and other fun stuff.

Once you have information about what the query optimizer is actually doing with the query, and you have all your indexes / statistics / column selection straight, you'll have a better idea of where to go from there. If you do all you can in the database, you'll have to look at using caches of the data and doing things like going at fewer tables with a more specific / better where clause.

Disclaimer: I have no direct experience with SQL Server, but I have a lot of experience on other RDBMSs (Oracle, MySQL, PostgreSQL, etc.) and architecture in general.

like image 40
Eric Sammer Avatar answered Dec 27 '22 01:12

Eric Sammer