Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL CASE vs JOIN efficiency

Tags:

sql-server

In my MS SQL Server database I am pulling transaction data based on a variety of different codes that are in one column.

Would it be more efficient to:

  • join the same table over and over for each code in a WHERE clause

  • do multiple case statements on the whole table (shown below)

  • do multiple case statements on the whole table but limit it by a WHERE SubsidCde IN ('AA','BA','BB', etc) clause

We have so many queries running per second that even though I have tried all 3 methods I get no definitive results.

SELECT
    SUM(CASE WHEN Subsid_Cde = 'AA' THEN Trans_Amt END),0) [AA],
    SUM(CASE WHEN Subsid_Cde = 'BA' THEN Trans_Amt END),0) [BA],
    SUM(CASE WHEN Subsid_Cde = 'BB' THEN Trans_Amt END),0) [BB]
FROM
    Transactions

--  There are 8 more rows like this, using a different code for each line
like image 487
kd7iwp Avatar asked Jun 16 '09 18:06

kd7iwp


4 Answers

If you're summing all possible (or most) values of Subsid_Cde field, then CASE is faster as it won't scan the table multiple times as it aggregates the sums. If you only looking for a small subset of possible Subsid_Cde fields then separate selects / joins (along with an index on Subsid_Cde) will work faster.

You need to learn to read Execution Plans, then you'll be able to figure such things by yourself.

Also, alternatively, you could do a GROUP BY on Subsid_Cde wrapped into a PIVOT clause (google for PIVOT MS SQL SERVER 2005)

like image 59
Stop Putin Stop War Avatar answered Oct 02 '22 12:10

Stop Putin Stop War


3 is your best bet. It's easy to read, it's easy to modify later on, and it should use the indexes that you've defined and expect to be using (still, check).

--1 Sometimes you have to join to the same table. But this isn't one of them and joining every time you need to include a new Subsid_Cde makes for less readable SQL without really gaining anything.

--2 Transaction tables tend to grow very large, so you NEVER want to scan the entire table. So #2 is definitely out, unless the codes you'll be using in your query gives you back all of the rows anyway.

like image 45
hythlodayr Avatar answered Oct 02 '22 13:10

hythlodayr


Use this:

SELECT  (
        SELECT  SUM(Trans_Amt)
        FROM    Transactions
        WHERE   Subsid_Cde = 'AA'
        ) AS sum_aa,
        (
        SELECT  SUM(Trans_Amt)
        FROM    Transactions
        WHERE   Subsid_Cde = 'BB'
        ) AS sum_bb

, without external FROM or WHERE clause.

In SQL Server 2005+, use this:

SELECT  [AA], [BB]
FROM    (
        SELECT  trans_amt, subsid_cde
        FROM    transactions
        ) q
PIVOT   (
        SUM(trans_amt)
        FOR subsid_cde IN (['AA'], ['BB'])
        )
like image 44
Quassnoi Avatar answered Oct 02 '22 13:10

Quassnoi


There are some very important considerations when you perform a task like this.

  • Limitations of CASE
  • Ownership that comes with CASE

CASE is an expression and not a statement.

The code is going to revaluate 'everything relevant' for each WHEN.

In your instance of CASE, it's looking against a raw table and doesn't evaluate beyond one WHEN, so it isn't going to re-query, it's going to do one scan or seek depending on your index and return a result, and perform that for each CASE but it would have to perform that operation for the JOIN, the fact that your result is coded into the case means that the system isn't having to search for the value to check.

That last point is huge because it has some pretty major implications for both the business and you as the coder.

There is a really great article that Aaron Bertrand wrote that goes into detail about "The Dirty Secrets of Case" https://sqlperformance.com/2014/06/t-sql-queries/dirty-secrets-of-the-case-expression

The summary is, that it could potentially re-query for each WHEN, which doesn't sound too bad, but if you have a SQL query that suffers from growth factor poorly, your CASE solution won't be scalable and if you use ANY FUNCTION that re-evaluates at invocation [RAND() being a culprit] it is going to produce not only inconsistent results in some cases, but could have a huge impact hit on your output if your query has a 'Big O notation that involves powers of N' (which is a very technical way of saying, uses a Sort, Union, Distinct or other forms of passive sort invocation or worse, is engaging an outdated plan without up-to-date statistics. This means CASE could potentially double, triple or as I like to call it '(N)WHEN' the query time of a monster that already is suffering from an exponential iteration problem.

Your CASE is going to require 'TECHNICAL MAINTENANCE' (Whenever a new code is implemented, YOU are going to have to recode to factor that in), a JOIN is going to require 'USER MAINTENANCE' (User alters the lookup table, and your query is updated without you needing to take any ownership)

'TECHNICAL MAINTENANCE' puts YOU at RISK of becoming entangled in the consequences of business mistakes.

You always want to code your solutions to minimise your ownership of the data-throughput because you don't want to be in a court talking about your little slice of involvement as you touched business logic, you want to be at home with your family or on vacation without getting phone calls about 'CC' they just added and want Summing in your report.

In the instance of CASE, you have coded yourself into a situation where you are performing business logic that the business should be deciding in a look up table that SHOULD be a JOIN.

In short, when the business comes up with a new code, your report is not seeing that output, which is going to cause you production draw, especially if you go on holiday and find yourself being the only person who knows about the existence of your CASE, you might not see it as much of an overhead, but you're not thinking long term, because memory is hazy in the long run and you may find yourself searching for your CASE and then adding to it in the future, which you won't thank yourself for.

To save 4 seconds, you're personally inconveniencing yourself.

Sacrificing yourself at the alter of speed is not good coding practice.

like image 42
Izalias Avatar answered Oct 02 '22 14:10

Izalias