Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Role of selectivity in index scan/seek

I have been reading in many SQL books and articles that selectivity is an important factor in creating index. If a column has low selectivity, an index seek does more harm that good. But none of the articles explain why. Can anybody explain why it is so, or provide a link to a relevant article?

like image 271
SexyBeast Avatar asked Aug 28 '12 08:08

SexyBeast


People also ask

What is a selectivity index?

The selectivity index (SI) is a ratio that measures the window between cytotoxicity and antiviral activity by dividing the given AVA value into the TOX value (AVA/TOX). The higher the SI ratio, the theoretically more effective and safe a drug would be during in vivo treatment for a given viral infection.

What is selectivity in database?

The selectivity is the fraction of rows in a table or partition that is chosen by the predicate. It is a number between 0 and 1. The selectivity of a predicate is used to estimate the cost of a particular access method; it is also used to determine the optimal join order.

What is meant by the selectivity of an operation?

Selectivity refers to the probability that any row will satisfy a predicate (that is, be true). For example, a selectivity of 0.01 (1%) for a predicate operating on a table with 1,000,000 rows means that the predicate returns an estimated 10,000 rows (1% of 1,000,000), and discards an estimated 990,000 rows.

What does high selectivity mean?

Another way of understanding selectivity is as a measure of the uniqueness of the data in the indexed column(s). Higher selectivity means. More unique data. Fewer duplicates. Fewer number of rows for each key value.

What is the difference between index scan and seek?

Index scan means it retrieves all the rows from the table and index seek means it retrieves selective rows from the table.

What is the difference between index selectivity and index depth?

Index density is a measure of the lack of uniqueness of the data in a table. A dense column is one that has a high number of duplicates. Index selectivity is a measure of how many rows scanned compared to the total number of rows.


2 Answers

I try to write a very simple explanation (based on my current knowledge of Sql Server):

If an index has low selectivity it means that for the same value a bigger percentage of the total rows are found. (like 200 from the 500 rows has the same value on your index based)

Usually if the index does not contain all the column information what you need, then it is using a pointer, where to find the row physically which is connected to that "entry" on the index. Then in a secpnd step the engine has to read out that row.

So as you see a search like this using two step. And here comes the selectivity:

More results you get becuse of the low selectivity more double work the engine has to do. So there are some cases because of this fact where even a table scan is more efficient then an index seek with very low selectivity.

like image 37
András Ottó Avatar answered Nov 02 '22 11:11

András Ottó


From SimpleTalk article by Robert Sheldon: 14 SQL Server Indexing Questions You Were Too Shy To Ask

The ratio of unique values within a key column is referred to as index selectivity. The more unique the values, the higher the selectivity, which means that a unique index has the highest possible selectivity. The query engine loves highly selective key columns, especially if those columns are referenced in the WHERE clause of your frequently run queries. The higher the selectivity, the faster the query engine can reduce the size of the result set. The flipside, of course, is that a column with relatively few unique values is seldom a good candidate to be indexed.

Also check these articles:

  • Check this post by Pinal Dave
  • this other on SQL Serverpedia
  • This forum post on SqlServerCentral can help you too.
  • This article on SqlServerCentral also

From the SqlServerCentral article:

In general, a nonclustered index should be selective. That is, the values in the column should be fairly unique and queries that filter on it should return small portions of the table.

The reason for this is that key/RID lookups are expensive operations and if a nonclustered index is to be used to evaluate a query it needs to be covering or sufficiently selective that the costs of the lookups aren’t deemed to be too high.

If SQL considers the index (or the subset of the index keys that the query would be seeking on) insufficiently selective then it is very likely that the index will be ignored and the query executed as a clustered index (table) scan.

It is important to note that this does not just apply to the leading column. There are scenarios where a very unselective column can be used as the leading column, with the other columns in the index making it selective enough to be used.

like image 186
Yaroslav Avatar answered Nov 02 '22 12:11

Yaroslav