Hypothetically, in a SQL Server database, if I have a table with two int fields (say a many-to-many relation) that participates in joins between two other tables, at what approximate size does the table become large enough where the performance benefit of indexes on the two int fields overcomes the overhead imposed by said indexes?
Are there differences in architecture between different versions of SQL Server that would substantially change this answer?
For the queries involving small portions of the table rows, indexes are always beneficial, be there 100
rows or 1,000,000
.
See this entry in my blog for examples with plans and performance details:
The queries like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
will most probably use HASH JOIN
. A hash table for the smaller table will be built, and the rows from the larger table will be used to probe the hash table.
To do this, no index is needed.
However, this query:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
WHERE t1.othercol = @value
will use NESTED LOOPS
: the rows from the outer table (table1
) will be searched using an index on table1.othercol
, and the rows from the inner table (table2
) will be searched using an index on table2.col
.
If you don't have an index on col1
, a HASH JOIN
will be used which requires scanning all rows from both tables and some more resources to built a hash table.
Indexes are also useful for the queries like this:
SELECT t2.col
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
, in which case the engine doesn't need to read table2
itself at all: eveything you need for this query can be found in the index, which can be much smaller than the table itself and more efficient to read.
And, of course, if you need your data sorted and have indexes on both table1.col
and table2.col
, then the following query:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col = t1.col
ORDER BY
t2.col
will probably use MERGE JOIN
method, which is super fast if both input rowset are sorted, and its output is also sorted, which means that ORDER BY
comes free.
Note that even if you don't have an index, an optimizer may choose to Eager Spool
your small table, which means building a temporary index for the duration of the query and dropped the index after the query finishes.
If the query is small, it will be very fast, but again, an index won't hurt (for SELECT
queries I mean). If the optimizer won't need it, it just will not be used.
Note, though, that creating an index may affect DML
performance, but it's other story.
The penalty for insertion will be negligible until long after the benefit of the indexes will appear. The optimizer is smart enough to ignore the indexes anyway until that point clicks in. So just index the table from the start.
Regardless of size, there is always a performance benefit to using an index when doing a lookup.
Regarding overhead, the question becomes: what overhead do you mean, and how do you relate it to the value of a lookup? The two are separate values, after all.
There are two forms of overhead for an index: space (which is usually negligible, depending on how the index is structured), and re-index on insert (the server must recalculate an index after every insert).
As I mentioned, the space issue probably isn't that big a deal. But re-indexing is. Fortunately, you need to be doing lots of near-continuous inserting before that form of overhead becomes a problem.
So bottom line: You're almost always better off having an index. Start from that position and wait until re-indexing becomes a bottleneck. Then you can look into alternatives.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With