I have a table containing about 500 points and am looking for duplicates within a tolerance. This takes less than a second and gives me 500 rows. Most have a distance of zero because it gives the same point (PointA = PointB)
DECLARE @TOL AS REAL
SET @TOL = 0.05
SELECT
PointA.ObjectId as ObjectIDa,
PointA.Name as PTNameA,
PointA.[Description] as PTdescA,
PointB.ObjectId as ObjectIDb,
PointB.Name as PTNameB,
PointB.[Description] as PTdescB,
ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
JOIN [CadData].Survey.SurveyPoint PointB
ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
-- AND
-- PointA.ObjectId <> PointB.ObjectID
ORDER BY ObjectIDa
If I use the commented out lines near the bottom, I get 14 rows but the execution time goes up to 14 seconds. Not that big a deal until my point table expands to 10's of thousands.
I apologize in advance if the answer is already out there. I did look, but being new I get lost reading posts which are way over my head.
ADDENDUM: ObjectID is a bigint and the PK for the table, so I realized that I could change the statement to
AND PointA.ObjectID > PointB.ObjectID
This now takes half the time and gives me half the results (7 rows in 7 seconds). I now don't get duplicates (as in Point 4 is close to Point 8 followed by Point 8 is close to Point 4). However the performance still concerns me as the table will be very large, so any performance issues will become problems.
ADDENDUM 2: Changing the order of the JOIN and AND (or WHERE as suggested) as below makes no difference either.
DECLARE @TOL AS REAL
SET @TOL = 0.05
SELECT
PointA.ObjectId as ObjectIDa,
PointA.Name as PTNameA,
PointA.[Description] as PTdescA,
PointB.ObjectId as ObjectIDb,
PointB.Name as PTNameB,
PointB.[Description] as PTdescB,
ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
JOIN [CadData].Survey.SurveyPoint PointB
ON PointA.ObjectId < PointB.ObjectID
WHERE
PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa
I find it fascinating that I can change the @Tol value to something large that returns over 100 rows with no change in performance even though it requires many computations. But then adding a simple A
Basically, join order DOES matter because if we can join two tables that will reduce the number of rows needed to be processed by subsequent steps, then our performance will improve.
Joins: If your query joins two tables in a way that substantially increases the row count of the result set, your query is likely to be slow. There's an example of this in the subqueries lesson. Aggregations: Combining multiple rows to produce a result requires more computation than simply retrieving those rows.
Factors that affect the performance of joins are join order, indexes, and join strategy.
This is a fun question.
It's not unrealistic that you get a large performance improvement by changing from "<>" to ">".
As others have mentioned, the trick is to get the most out of your indexes. Certainly by using ">", you should easily get the server to limit to that specific range on your PK - avoiding looking "backwards" when you've already checked looking "forwards".
This improvement will scale - will help as you add rows. But you're right to worry it won't help prevent any increase in work. As you're correctly thinking, as long as you have to scan a larger number of rows, it will take longer. And that's the case here because we always want to compare everything.
If the first part is looking good, just the TOL check, have you thought about splitting out the second part entirely?
Change the first part to dump into a temp table as
SELECT
PointA.ObjectId as ObjectIDa,
PointA.Name as PTNameA,
PointA.[Description] as PTdescA,
PointB.ObjectId as ObjectIDb,
PointB.Name as PTNameB,
PointB.[Description] as PTdescB,
ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
into #AllDuplicatesWithRepeats
FROM CadData.Survey.SurveyPoint PointA
JOIN [CadData].Survey.SurveyPoint PointB
ON
PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa
And they you can write the direct query that skips duplicates, below. It isn't special, but against that small set in the temp table it should be perfectly speedy.
Select
*
from
#AllDuplicatesWithRepeats d1
left join #AllDuplicatesWithRepeats d2 on (
d1.objectIDa = d2.objectIDb
and
d1.objectIDb = d2.objectIDa
)
where
d2.objectIDb is null
The execution plan is probably doing something behind the scenes when you add in the ObjectID
comparison. Check the execution plan to see if the two different versions of the query are, for example, using an index seek vs. a table scan. If so, consider experimenting with query hints.
As a workaround, you could always use a subquery:
DECLARE @TOL AS REAL
SET @TOL = 0.05
SELECT
ObjectIDa,
PTNameA,
PTdescA,
ObjectIDb,
PTNameB,
PTdescB,
DIST
FROM
(
SELECT
PointA.ObjectId as ObjectIDa,
PointA.Name as PTNameA,
PointA.[Description] as PTdescA,
PointB.ObjectId as ObjectIDb,
PointB.Name as PTNameB,
PointB.[Description] as PTdescB,
ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
JOIN [CadData].Survey.SurveyPoint PointB
ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
-- AND
-- PointA.ObjectId <> PointB.ObjectID
) Subquery
WHERE ObjectIDa <> ObjectIDb
ORDER BY ObjectIDa
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With