I have a table containing about 500 points and am looking for duplicates within a tolerance. This takes less than a second and gives me 500 rows. Most have a distance of zero because it gives the same point (PointA = PointB) <pre class="prettyprint"><code>DECLARE @TOL AS REAL SET @TOL = 0.05 SELECT PointA.ObjectId as ObjectIDa, PointA.Name as PTNameA, PointA.[Description] as PTdescA, PointB.ObjectId as ObjectIDb, PointB.Name as PTNameB, PointB.[Description] as PTdescB, ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST FROM CadData.Survey.SurveyPoint PointA JOIN [CadData].Survey.SurveyPoint PointB ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL -- AND -- PointA.ObjectId <> PointB.ObjectID ORDER BY ObjectIDa </code></pre> If I use the commented out lines near the bottom, I get 14 rows but the execution time goes up to 14 seconds. Not that big a deal until my point table expands to 10's of thousands. I apologize in advance if the answer is already out there. I did look, but being new I get lost reading posts which are way over my head. ADDENDUM: ObjectID is a bigint and the PK for the table, so I realized that I could change the statement to <pre class="prettyprint"><code>AND PointA.ObjectID > PointB.ObjectID </code></pre> This now takes half the time and gives me half the results (7 rows in 7 seconds). I now don't get duplicates (as in Point 4 is close to Point 8 followed by Point 8 is close to Point 4). However the performance still concerns me as the table will be very large, so any performance issues will become problems. ADDENDUM 2: Changing the order of the JOIN and AND (or WHERE as suggested) as below makes no difference either. <pre class="prettyprint"><code>DECLARE @TOL AS REAL SET @TOL = 0.05 SELECT PointA.ObjectId as ObjectIDa, PointA.Name as PTNameA, PointA.[Description] as PTdescA, PointB.ObjectId as ObjectIDb, PointB.Name as PTNameB, PointB.[Description] as PTdescB, ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST FROM CadData.Survey.SurveyPoint PointA JOIN [CadData].Survey.SurveyPoint PointB ON PointA.ObjectId < PointB.ObjectID WHERE PointA.Geometry.STDistance(PointB.Geometry) < @TOL ORDER BY ObjectIDa </code></pre> I find it fascinating that I can change the @Tol value to something large that returns over 100 rows with no change in performance even though it requires many computations. But then adding a simple A

This is a fun question. It's not unrealistic that you get a large performance improvement by changing from "<>" to ">". As others have mentioned, the trick is to get the most out of your indexes. Certainly by using ">", you should easily get the server to limit to that specific range on your PK - avoiding looking "backwards" when you've already checked looking "forwards". This improvement will scale - will help as you add rows. But you're right to worry it won't help prevent any increase in work. As you're correctly thinking, as long as you have to scan a larger number of rows, it will take longer. And that's the case here because we always want to compare everything. If the first part is looking good, just the TOL check, have you thought about splitting out the second part entirely? Change the first part to dump into a temp table as <pre class="prettyprint"><code>SELECT PointA.ObjectId as ObjectIDa, PointA.Name as PTNameA, PointA.[Description] as PTdescA, PointB.ObjectId as ObjectIDb, PointB.Name as PTNameB, PointB.[Description] as PTdescB, ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST into #AllDuplicatesWithRepeats FROM CadData.Survey.SurveyPoint PointA JOIN [CadData].Survey.SurveyPoint PointB ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL ORDER BY ObjectIDa </code></pre> And they you can write the direct query that skips duplicates, below. It isn't special, but against that small set in the temp table it should be perfectly speedy. <pre class="prettyprint"><code>Select * from #AllDuplicatesWithRepeats d1 left join #AllDuplicatesWithRepeats d2 on ( d1.objectIDa = d2.objectIDb and d1.objectIDb = d2.objectIDa ) where d2.objectIDb is null </code></pre>

Adding simple AND after JOIN kills performance

Tags:

sql

sql-server

geospatial

spatial-query

I have a table containing about 500 points and am looking for duplicates within a tolerance. This takes less than a second and gives me 500 rows. Most have a distance of zero because it gives the same point (PointA = PointB)

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
   -- AND
   -- PointA.ObjectId <> PointB.ObjectID
ORDER BY ObjectIDa

If I use the commented out lines near the bottom, I get 14 rows but the execution time goes up to 14 seconds. Not that big a deal until my point table expands to 10's of thousands.

I apologize in advance if the answer is already out there. I did look, but being new I get lost reading posts which are way over my head.

ADDENDUM: ObjectID is a bigint and the PK for the table, so I realized that I could change the statement to

AND PointA.ObjectID > PointB.ObjectID

This now takes half the time and gives me half the results (7 rows in 7 seconds). I now don't get duplicates (as in Point 4 is close to Point 8 followed by Point 8 is close to Point 4). However the performance still concerns me as the table will be very large, so any performance issues will become problems.

ADDENDUM 2: Changing the order of the JOIN and AND (or WHERE as suggested) as below makes no difference either.

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.ObjectId < PointB.ObjectID
    WHERE
    PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa

I find it fascinating that I can change the @Tol value to something large that returns over 100 rows with no change in performance even though it requires many computations. But then adding a simple A

833

asked Dec 30 '13 04:12

Land Surveyor

2 Answers

This is a fun question.

It's not unrealistic that you get a large performance improvement by changing from "<>" to ">".

As others have mentioned, the trick is to get the most out of your indexes. Certainly by using ">", you should easily get the server to limit to that specific range on your PK - avoiding looking "backwards" when you've already checked looking "forwards".

This improvement will scale - will help as you add rows. But you're right to worry it won't help prevent any increase in work. As you're correctly thinking, as long as you have to scan a larger number of rows, it will take longer. And that's the case here because we always want to compare everything.

If the first part is looking good, just the TOL check, have you thought about splitting out the second part entirely?

Change the first part to dump into a temp table as

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST

into #AllDuplicatesWithRepeats

FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON 
    PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa

And they you can write the direct query that skips duplicates, below. It isn't special, but against that small set in the temp table it should be perfectly speedy.

Select
    *
from    
    #AllDuplicatesWithRepeats d1
        left join #AllDuplicatesWithRepeats d2 on (
                        d1.objectIDa = d2.objectIDb
                        and
                        d1.objectIDb = d2.objectIDa
                        )
where
    d2.objectIDb is null

answered Sep 18 '22 13:09

Mike M

The execution plan is probably doing something behind the scenes when you add in the ObjectID comparison. Check the execution plan to see if the two different versions of the query are, for example, using an index seek vs. a table scan. If so, consider experimenting with query hints.

As a workaround, you could always use a subquery:

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    ObjectIDa,
    PTNameA,
    PTdescA,
    ObjectIDb,
    PTNameB,
    PTdescB,
    DIST
FROM
(
SELECT 
  PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
   -- AND
   -- PointA.ObjectId <> PointB.ObjectID
) Subquery
WHERE ObjectIDa <> ObjectIDb
ORDER BY ObjectIDa

answered Sep 22 '22 13:09

Mike

Related questions
                            
                                The job failed. The job was invoked by user<user>. The last step to run was step1
                            
                                How to sort the Japanese Character in Sql
                            
                                SQL grouping query results for a single column
                            
                                Database migration: auto-incremented foreign key trouble
                            
                                When does MySQL update the indexes
                            
                                SQL Server notification when query is done?
                            
                                Oracle case inside where clause
                            
                                Deduping database records comparing values in numerous fields
                            
                                MySQL subtract two count columns
                            
                                Concatenate column values as single value SQL Server 2005
                            
                                Linq query for only the first N rows for each unique ID
                            
                                Indexed ORDER BY with LIMIT 1
                            
                                Get number of employees who worked in more than one department with SQL query
                            
                                C# Invalid object name ASP.NET
                            
                                What is the correct way to do inserts/updates/deletes in Android SQLiteDatabase using a query string?
                            
                                Incorporate additional requirements into a legacy database design
                            
                                How to run a .sql script (from file) in Java and return a ResultSet using Spring?
                            
                                DATE vs. DATETIME casting of invalid dates in SQL SERVER 2008 R2
                            
                                Translating relationship attributes from ER diagram into SQL
                            
                                Firebird trigger before delete

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With