Optimize self join on millions of rows

Question

I have a table which is a link table from objects in my SQL Server 2012 database, (annonsid, annonsid2). This table is used to create chains of triangle or even rectangles to see who can swap with who.

This is the query I use on the table Matching_IDs which has 1,5 million rows in it, producing 14 million possible chains using this query:

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid
  INNER JOIN Matching_IDs AS m3
     ON m2.annonsid2 = m3.annonsid
       AND m.annonsid = m3.annonsid2

I must improve performance to take maybe 1 second or less, Is there a faster way to do this? The query takes about 1 minute on my computer. I normally use a WHERE m.annonsid=x, but it takes just the same amount of time, cause it has to go through all possible combinations anyway.

Update: the latest query plan

|--Compute Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
   |--Stream Aggregate(DEFINE:([globalagg1011]=SUM([partialagg1010])))
        |--Parallelism(Gather Streams)
             |--Stream Aggregate(DEFINE:([partialagg1010]=Count(*)))
                  |--Hash Match(Inner Join, HASH:([m2].[annonsid2], [m2].[annonsid])=([m3].[annonsid], [m].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m2].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m3].[annonsid] AND [MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m].[annonsid2]=[MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m2].[annonsid]))
                       |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m2].[annonsid2], [m2].[annonsid]))
                       |    |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m2]))
                       |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid], [m].[annonsid2]))
                            |--Merge Join(Inner Join, MANY-TO-MANY MERGE:([m].[annonsid])=([m3].[annonsid2]), RESIDUAL:([MyDatabase].[dbo].[Matching_IDs].[annonsid] as [m].[annonsid]=[MyDatabase].[dbo].[Matching_IDs].[annonsid2] as [m3].[annonsid2]))
                                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m].[annonsid]), ORDER BY:([m].[annonsid] ASC))
                                 |    |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133152] AS [m]), ORDERED FORWARD)
                                 |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m3].[annonsid2]), ORDER BY:([m3].[annonsid2] ASC))
                                      |--Index Scan(OBJECT:([MyDatabase].[dbo].[Matching_IDs].[NonClusteredIndex-20121229-133207] AS [m3]), ORDERED FORWARD)

alzaimar · Accepted Answer

Some ideas:

Try two indexes (annonsid,annonsid2) and (annonsid2,annonsid)

Have you tried a column store index? It makes the table read only but it might improve performance.

Also, some variations of the query could help. Examples:

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid
  INNER JOIN Matching_IDs AS m3
     ON m2.annonsid2 = m3.annonsid
where m.annonsid = m3.annonsid2

or

SELECT COUNT(*)
FROM Matching_IDs AS m, Matching_IDs AS m2, Matching_IDs AS m3
where m2.annonsid2 = m3.annonsid
  and m.annonsid2 = m2.annonsid
  and m.annonsid = m3.annonsid2

Did you check the CPU/IO-Load? If IO-Load is high, then the server is not crunching numbers but swapping => more RAM solves the problem.

How fast is this query?

SELECT COUNT(*)
FROM Matching_IDs AS m
  INNER JOIN Matching_IDs AS m2
     ON m.annonsid2 = m2.annonsid

If this is very fast but adding the next join slows thing down then you propably need more RAM.

usr · Answer

It seems like you already indexed this quite well. You can try converting the hash to a merge join by adding the right multi-column index, but it won't give you the desired speedup of 60x.

I think this index would be on annonsid, annonsid2 although I might have made a mistake here.

It would be nice to materialize all of this but indexed views do not support self-joins. You can try to materialize this query (unaggregated) into a new table. Whenever you execute DML against the base table, also update the second table (using either application logic or triggers). That would allow you to query blazingly fast.

Optimize self join on millions of rows

Tags:

sql

optimization

sql-server

sql-server-2012

infinity1975

2 Answers

alzaimar

usr

Recent Activity

Donate For Us

Optimize self join on millions of rows

Tags:

sql

optimization

sql-server

sql-server-2012

infinity1975

2 Answers

alzaimar

usr

Related questions

Recent Activity

Donate For Us