Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are indices on columns with very few unique values not effective?

Tags:

database

So, most database experts say it's not effective creating an index on a column with very few unique values, relative to the size of the table.

Based on how databases work internally (I know most databases store indices using a B-tree), why does a B-Tree with few unique values make searching inefficient?

like image 690
Henley Avatar asked Jun 26 '13 00:06

Henley


People also ask

Do indexed columns need to be unique?

No, you dont have to index it again. When you specify UNIQUE KEY , the column is indexed. So it has no difference in performance with other indexed column (e.g. PRIMARY KEY) of same type. However if the type is different, there will be little performance difference.

What are the limitations of using indexes?

Its disadvantages include increased disk space, slower data modification, and updating records in the clustered index.

Which of the columns are least suitable for indexes?

If you create an index such as INDEX(first_name, last_name), don't create INDEX(first_name). However, "index prefix" or "multi-columns index" is not recommended in all search cases. Use the NOT NULL attribute for those columns in which you consider the indexing, so that NULL values will never be stored.

Do indexes need to be unique?

index: if it's not primary or unique, it doesn't constrain values inserted into the table, but it does allow them to be looked up more efficiently.


2 Answers

First, you need to understand how an index on a column works. In simple words it is nothing but,

an ordered list of all possible values in the given column with a pointer back to the actual record in the database.

Since it is ordered, a binary search can be used against it, rather than a linear search, which improves performance over a large dataset.

Imagine then, your index as a phone book ordered by a column, say last name; but within the set of records with a similar last name, there isn't a common pattern or meaningful order for the records: they are ordered purely random. And say we need to search this record:

Ike Smith 4783 Random Ave. Seattle, WA 98117

Since the phone book is ordered by last name, we need only to go to the S, then the m, then the i, etc. until we find Smith. And (hopefully) there are only a couple of records under Smith so we find the one we want fairly quickly.

Now, imagine you have a phone book ordered by city instead of last name. And within the records that match a given city there is no particular order. And so we try our search again. However, once we find Seattle (using a extremely sophisticated binary search) we are left with close to 620,778 records, which we have to check sequentially as they ordered completely random. We're stuck checking every single entry for the record we want.

This is what happens when you use a very common column as the base of your index: the binary search returns a very large set of possible records with which the database can't make any assumptions beyond the initial indexed column values, so it needs to check sequentially within the resulting set for a matching record.

If the phone book were instead ordered by zip code (a less common variable), then you might find yourself only searching for 18,623 records residing on 98117.

Furthermore, a true phone book usually resembles a composite index: instead of just ordering by a single column (i.e. last name), the values within the resulting set are then ordered by another column (say first name), and then another (middle name?) so the search can be done sub linearly at every step until you find the record needed. It it basically an index within an index, where even if the first column is not that common, the combination with the second one provides a specific enough criteria that only a small set of records need to be search linearly.

like image 78
rae1 Avatar answered Sep 19 '22 20:09

rae1


In general, the goal of the index is to provide faster than linear search by avoiding having to scan through a significant portion of the data in the table (see http://en.wikipedia.org/wiki/Database_index). If many of the would-be indexed values are identical, the database has to scan a significant portion of the table even after a successful index lookup.

So an index that has few unique values would provide very little performance benefit independent of its implementation.

like image 25
James Avatar answered Sep 19 '22 20:09

James