Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow performing SQL query with triple self-join

I have a legacy database with the following table (note: no primary key)

It defines each a record for each accommodation "unit" and date, and the price for that date.

CREATE TABLE [single_date_availability](
    [accommodation_id] [int],
    [accommodation_unit_id] [int],
    [arrival_date] [datetime],
    [price] [decimal](18, 0),
    [offer_discount] [decimal](18, 0),
    [num_pax] [int],
    [rooms_remaining] [int],
    [eta_available] [int],
    [date_correct] [datetime],
    [max_occupancy] [int],
    [max_adults] [int],
    [min_stay_nights] [int],
    [max_stay_nights] [int],
    [nights_remaining_count] [numeric](2, 0)
) ON [PRIMARY]

The table contains roughly 16,500 records.

But I need to multiply out the data in a completely different format, like such:

  • Accommodation
  • Date
  • Duration
  • Total price

Up to a max duration for each arrival date.

I'm using the following query to achieve this:

SELECT
    MIN(units.MaxAccommodationAvailabilityPax) AS MaxAccommodationAvailabilityPax,
    MIN(units.MaxAccommodationAvailabilityAdults) AS MaxAccommodationAvailabilityAdults,
    StartDate AS DepartureDate,
    EndDate AS ReturnDate,
    DATEDIFF(DAY, StartDate, EndDate) AS Duration,
    MIN(units.accommodation_id) AS AccommodationID, 
    x.accommodation_unit_id AS AccommodationUnitID,
    SUM(Price) AS Price,
    MAX(num_pax) AS Occupancy,
    SUM(offer_discount) AS OfferSaving,
    MIN(date_correct) AS DateTimeCorrect,
    MIN(rooms_remaining) AS RoomsRemaining,
    MIN(CONVERT(int, dbo.IsGreaterThan(ISNULL(eta_available, 0)+ISNULL(nights_remaining_count, 0), 0))) AS EtaAvailable
FROM single_date_availability fp
INNER JOIN (
    /* This gets max availability for the whole accommodation on the arrival date */
    SELECT accommodation_id, arrival_date,
        CASE EtaAvailable WHEN 1 THEN 99 ELSE MaxAccommodationAvailabilityPax END AS MaxAccommodationAvailabilityPax,
        CASE EtaAvailable WHEN 1 THEN 99 ELSE MaxAccommodationAvailabilityAdults END AS MaxAccommodationAvailabilityAdults
    FROM (SELECT accommodation_id, arrival_date, SUM(MaximumOccupancy) MaxAccommodationAvailabilityPax, SUM(MaximumAdults) MaxAccommodationAvailabilityAdults,
            CONVERT(int, WebData.dbo.IsGreaterThan(SUM(EtaAvailable), -1)) AS EtaAvailable                 
            FROM (SELECT accommodation_id, arrival_date, MIN(rooms_remaining*max_occupancy) as MaximumOccupancy,
                    MIN(rooms_remaining*max_adults) as MaximumAdults, MIN(ISNULL(eta_available, 0) + ISNULL(nights_remaining_count, 0) - 1) as EtaAvailable
                    FROM single_date_availability
                    GROUP BY accommodation_id, accommodation_unit_id, arrival_date) a 
            GROUP BY accommodation_id, arrival_date) b
) units ON fp.accommodation_id = units.accommodation_id AND fp.arrival_date = units.arrival_date
INNER JOIN (
    /* This gets every combination of StartDate and EndDate for each Unit/Occupancy */
    SELECT DISTINCT a.accommodation_unit_id, StartDate = a.arrival_date,
        EndDate = b.arrival_date+1, Duration = DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1
        FROM single_date_availability AS a
        INNER JOIN (SELECT accommodation_unit_id, arrival_date FROM single_date_availability) AS b
        ON a.accommodation_unit_id = b.accommodation_unit_id
            AND DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1 >= a.min_stay_nights
            AND DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1 <= (CASE a.max_stay_nights WHEN 0 THEN 28 ELSE a.max_stay_nights END)
) x ON fp.accommodation_unit_id = x.accommodation_unit_id AND fp.arrival_date >= x.StartDate AND fp.arrival_date < x.EndDate
GROUP BY x.accommodation_unit_id, StartDate, EndDate
/* This ensures that all dates between StartDate and EndDate are actually available */
HAVING COUNT(*) = DATEDIFF(DAY, StartDate, EndDate)

This works and gives me about 413,000 records. The results of this query I'm using to update another table.

But the query performs quite badly, as you might expect with so many self-joins. It takes about 15 secs to run locally, but on our test server takes over 1:30 mins, and on our live SQL server takes over 30 secs; and in all cases it maxes out the CPU while it's performing the larger of the joins.

No other processes are accessing the table at the same time, and that can be assumed.

I don't really mind the length of the query so much as the demand on the CPU, which can cause problems for other queries trying to access other databases / tables at the same time.

I have run the query through query optimizer and followed all the recommendations for indexes and statistics.

Any help on making this query faster or at least less CPU intensive would be much appreciated. If it needs to be broken down into different stages, that's acceptable.

To be honest speed is not so important as it's a bulk operation performed on a table that's not being touched by other processes.

I'm not particularly looking for comments on how terrible and un-normalized this structure is... that, I already know :-)

like image 495
Chris Haines Avatar asked Nov 03 '10 14:11

Chris Haines


People also ask

Do joins make queries slow?

Joins: If your query joins two tables in a way that substantially increases the row count of the result set, your query is likely to be slow. There's an example of this in the subqueries lesson. Aggregations: Combining multiple rows to produce a result requires more computation than simply retrieving those rows.

Does join affect query performance?

The order in which the tables in your queries are joined can have a dramatic effect on how the query performs. If your query happens to join all the large tables first and then joins to a smaller table later this can cause a lot of unnecessary processing by the SQL engine.

Which is the fastest join in SQL?

Includes the matching rows as well as some of the non-matching rows between the two tables. In case there are a large number of rows in the tables and there is an index to use, INNER JOIN is generally faster than OUTER JOIN.


2 Answers

This site is for professional programmers, right.

It is stultifying to try and operate on a "table" without a primary key. Fine, it is a workspace, not a real table (but it is large, and you are trying to perform relational table operations on it). Fine, you know it is unnormalised. Actually the database is unnormalised, and this "table" is a product of it: an exponential unnormalised product.

This works and gives me about 413,000 records. The results of this query I'm using to update another table.

That is even more crazy. All this (a) temp worktables an (b) temp worktables for the temp worktables business are classic symptoms of an unnormalised database. OR inability to understand the data as it is, how to get the data out, and creating unnecessary worktables to supply your need. I am not trying to get you to change that, which would be the first option, and which would eliminate the need for this entire mess.

The second option would be, see if you can produce the final result from the original tables, either:
- using no worktables
- using one worktable
instead of the two worktables (16,500 and 413,000 "records"; that's two levels of exponential unnormalisation)

The third option is, improve the mess you have ... but first you need to understand where the performance hogs are ...

But the query performs quite badly, as you might expect with so many self-joins

Nonsense, joins and self-joins cost nothing. The problems are, the cost is in:

  • you are operating on a Heap

  • without a PK

    • those two item alone mean performance has not been considered and cannot be expected
  • using operators and functions (rather than pure "=") in joins means the server cannot make reasonable decisions on the search values, so you are table scanning all the time

  • table size (maybe different on Dev/Test/Prod)

  • valid, useable indices (or not)

  • the cost is in those four items, the heaps being brutishly slow in every aspect, and the operators not identifying anything to narrow the searches; not the fact there is or is not a join operation.

The next series of issues is the way you are doing it.

  • Do you NOT realise that the "joins" are materialised tables; you are not "joining" you are materialising TABLES on the fly ??? Nothing is free: materialisation has an enormous cost. You are so focused on materialising without any idea of the cost, that you think the joins are the problem. Why is that ?

  • Before you can make any reasonable coding decisions, you need to SET SHOWPLAN and STATISTICS IO ON. Do this while you are developing (it is nowhere near ready for "testing"). That will give you an idea of the tables; the joins (what you expect vs what it determined, from the mess); the worktables (materialised). The high CPU usage is nothing, wait until you see the insane I/O your code uses. If you want to argue about the cost of materialising on the fly, be my guest, but post the SHOWPLAN first.

  • note that the materialised tables have no indices, so it table scans every time, for the um "joins".

  • The select as is, is doing tens of times (maybe hundreds) more work than it needs to. Since the table is there, and it has not moved, materialising another version of it is a very silly thing to do. So, the true question is:

How Come My SQL query with One table and Six Materialised Versions of Itself is Slow ?

.
In case you are not sure, that means eliminate the six materialised tables and replace them with pure joins to the main table.

  • If you can accept breaking it up, then do so. Create and load temp tables that this query is going to use FIRST (that means 3 temp tables for aggregates only). Make sure you place indices on the correct columns.

  • So the 6 materialised tables with be replaced with 3 joins to the main table, and 3 joins to temp aggregate tables.

  • Somewhere along the line, you have identified that you have cartesian products and duplicates; instead of fixing the cause (developing code that produces the set you need) you have avoided all that, left it full of dupes, and pulled out the DISTINCT rows. That causes an additional worktable. Fix that. You have to get each of the temp tables (worktables, materialised tables, whatever) correct FIRST, before the select that uses them can be reasonably expected to be correct.

  • THEN try the select.

  • I presume this is all running in WebData. If not, place IsGreaterThan() in this db.


  1. Please provide DDL for UDF IsGreaterThan. If that is using tables, we need to know about it.

  2. Please provide the alleged Indices with the CREATE TABLE statement. They could be incorrect or worse, doubled up and not required.

  3. Forget the Identity or forced values, what is the actual, real, natural, logical PK for this heap of a worktable ?

  4. Ensure you have no datatype mismatches on the join columns

  5. Personally, I would be too ashamed to post code such as you have. It is completely unreadbable. All I did, in order to identify the problems here, is format it, and make it readable. There are reasons for making code readable, such as, it allows you to spot problems quickly. It doesn't matter what formatting you use, but you have to format, and you have to do it consistently. Please clean it up before you post again, along with ALL related DDL.

It is no wonder that you have not been getting answers. You need to do some basic work first (showplan, etc) and prepare the code so that human beings can read it, so that they can provide answers.

SELECT
        MIN(units.MaxAccommodationAvailabilityPax) AS MaxAccommodationAvailabilityPax,
        MIN(units.MaxAccommodationAvailabilityAdults) AS MaxAccommodationAvailabilityAdults,
        StartDate AS DepartureDate,
        EndDate AS ReturnDate,
        DATEDIFF(DAY, StartDate, EndDate) AS Duration,
        MIN(units.accommodation_id) AS AccommodationID, 
        x.accommodation_unit_id AS AccommodationUnitID,
        SUM(Price) AS Price,
        MAX(num_pax) AS Occupancy,
        SUM(offer_discount) AS OfferSaving,
        MIN(date_correct) AS DateTimeCorrect,
        MIN(rooms_remaining) AS RoomsRemaining,
        MIN(CONVERT(int, dbo.IsGreaterThan(ISNULL(eta_available, 0)+ISNULL(nights_remaining_count, 0), 0))) 
            AS EtaAvailable
    FROM single_date_availability fp INNER JOIN (
        -- This gets max availability for the whole accommodation on the arrival date
        SELECT  accommodation_id, arrival_date,
                CASE EtaAvailable 
                    WHEN 1 THEN 99
                    ELSE MaxAccommodationAvailabilityPax 
                    END AS MaxAccommodationAvailabilityPax,
                CASE EtaAvailable
                    WHEN 1 THEN 99
                    ELSE MaxAccommodationAvailabilityAdults
                    END AS MaxAccommodationAvailabilityAdults
            FROM ( 
                SELECT  accommodation_id, arrival_date,
                        SUM(MaximumOccupancy) 
                        MaxAccommodationAvailabilityPax,
                        SUM(MaximumAdults) MaxAccommodationAvailabilityAdults,
                        CONVERT(int, WebData.dbo.IsGreaterThan(SUM(EtaAvailable), -1))
                            AS EtaAvailable                 
                    FROM ( 
                        SELECT  accommodation_id,
                                arrival_date,
                                MIN(rooms_remaining*max_occupancy) as MaximumOccupancy,
                                MIN(rooms_remaining*max_adults) as MaximumAdults, 
                                MIN(ISNULL(eta_available, 0) + ISNULL(nights_remaining_count, 0) - 1)
                                    as EtaAvailable
                            FROM single_date_availability
                            GROUP BY accommodation_id, accommodation_unit_id, arrival_date
                            ) a 
                    GROUP BY accommodation_id, arrival_date
                    ) b
            ) units 
        ON fp.accommodation_id = units.accommodation_id 
        AND fp.arrival_date = units.arrival_date INNER JOIN (
            -- This gets every combination of StartDate and EndDate for each Unit/Occupancy
            SELECT  D.I.S.T.I.N.C.T a.accommodation_unit_id,
                    StartDate = a.arrival_date,
                    EndDate = b.arrival_date+1,
                    Duration = DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1
                FROM single_date_availability AS a INNER JOIN ( 
                    SELECT  accommodation_unit_id,
                            arrival_date 
                        FROM single_date_availability
                        ) AS b
                ON a.accommodation_unit_id = b.accommodation_unit_id
                AND DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1 >= a.min_stay_nights
                AND DATEDIFF(DAY, a.arrival_date, b.arrival_date)+1 <= (
                    CASE a.max_stay_nights 
                        WHEN 0 THEN 28 
                        ELSE a.max_stay_nights 
                        END
                )
        ) x ON fp.accommodation_unit_id = x.accommodation_unit_id 
        AND fp.arrival_date >= x.StartDate 
        AND fp.arrival_date < x.EndDate
    GROUP BY x.accommodation_unit_id, StartDate, EndDate
    -- This ensures that all dates between StartDate and EndDate are actually available
    HAVING COUNT(*) = DATEDIFF(DAY, StartDate, EndDate)
like image 105
PerformanceDBA Avatar answered Nov 02 '22 04:11

PerformanceDBA


this most likely won't fix all of your issues, but try switching

AND DATEDIFF(DAY , a.arrival_date , b.arrival_date) + 1 >= a.min_stay_nights
AND DATEDIFF(DAY , a.arrival_date , b.arrival_date) + 1 <= (CASE a.max_stay_nights WHEN 0 THEN 28 ELSE a.max_stay_nights END)

to

and a.min_stay_nights<=DATEDIFF(DAY , a.arrival_date , b.arrival_date)
and (CASE a.max_stay_nights WHEN 0 THEN 28 ELSE a.max_stay_nights END)>=DATEDIFF(DAY , a.arrival_date , b.arrival_date) + 1

the reason being is that, as far as i can recall, sql server doesn't like functions on the left side of the = sign in where clauses

like image 45
DForck42 Avatar answered Nov 02 '22 05:11

DForck42