Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding simple AND after JOIN kills performance

I have a table containing about 500 points and am looking for duplicates within a tolerance. This takes less than a second and gives me 500 rows. Most have a distance of zero because it gives the same point (PointA = PointB)

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
   -- AND
   -- PointA.ObjectId <> PointB.ObjectID
ORDER BY ObjectIDa

If I use the commented out lines near the bottom, I get 14 rows but the execution time goes up to 14 seconds. Not that big a deal until my point table expands to 10's of thousands.

I apologize in advance if the answer is already out there. I did look, but being new I get lost reading posts which are way over my head.

ADDENDUM: ObjectID is a bigint and the PK for the table, so I realized that I could change the statement to

AND PointA.ObjectID > PointB.ObjectID

This now takes half the time and gives me half the results (7 rows in 7 seconds). I now don't get duplicates (as in Point 4 is close to Point 8 followed by Point 8 is close to Point 4). However the performance still concerns me as the table will be very large, so any performance issues will become problems.

ADDENDUM 2: Changing the order of the JOIN and AND (or WHERE as suggested) as below makes no difference either.

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.ObjectId < PointB.ObjectID
    WHERE
    PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa

I find it fascinating that I can change the @Tol value to something large that returns over 100 rows with no change in performance even though it requires many computations. But then adding a simple A

like image 833
Land Surveyor Avatar asked Dec 30 '13 04:12

Land Surveyor


People also ask

Do joins affect performance?

Basically, join order DOES matter because if we can join two tables that will reduce the number of rows needed to be processed by subsequent steps, then our performance will improve.

Do joins slow down query?

Joins: If your query joins two tables in a way that substantially increases the row count of the result set, your query is likely to be slow. There's an example of this in the subqueries lesson. Aggregations: Combining multiple rows to produce a result requires more computation than simply retrieving those rows.

What are the factors affecting the performance of join operation?

Factors that affect the performance of joins are join order, indexes, and join strategy.


2 Answers

This is a fun question.

It's not unrealistic that you get a large performance improvement by changing from "<>" to ">".

As others have mentioned, the trick is to get the most out of your indexes. Certainly by using ">", you should easily get the server to limit to that specific range on your PK - avoiding looking "backwards" when you've already checked looking "forwards".

This improvement will scale - will help as you add rows. But you're right to worry it won't help prevent any increase in work. As you're correctly thinking, as long as you have to scan a larger number of rows, it will take longer. And that's the case here because we always want to compare everything.

If the first part is looking good, just the TOL check, have you thought about splitting out the second part entirely?

Change the first part to dump into a temp table as

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST

into #AllDuplicatesWithRepeats

FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON 
    PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa

And they you can write the direct query that skips duplicates, below. It isn't special, but against that small set in the temp table it should be perfectly speedy.

Select
    *
from    
    #AllDuplicatesWithRepeats d1
        left join #AllDuplicatesWithRepeats d2 on (
                        d1.objectIDa = d2.objectIDb
                        and
                        d1.objectIDb = d2.objectIDa
                        )
where
    d2.objectIDb is null
like image 88
Mike M Avatar answered Sep 18 '22 13:09

Mike M


The execution plan is probably doing something behind the scenes when you add in the ObjectID comparison. Check the execution plan to see if the two different versions of the query are, for example, using an index seek vs. a table scan. If so, consider experimenting with query hints.

As a workaround, you could always use a subquery:

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    ObjectIDa,
    PTNameA,
    PTdescA,
    ObjectIDb,
    PTNameB,
    PTdescB,
    DIST
FROM
(
SELECT 
  PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
   -- AND
   -- PointA.ObjectId <> PointB.ObjectID
) Subquery
WHERE ObjectIDa <> ObjectIDb
ORDER BY ObjectIDa
like image 20
Mike Avatar answered Sep 22 '22 13:09

Mike