Why are indices on columns with very few unique values not effective?

2 Answers

First, you need to understand how an index on a column works. In simple words it is nothing but,

an ordered list of all possible values in the given column with a pointer back to the actual record in the database.

Since it is ordered, a binary search can be used against it, rather than a linear search, which improves performance over a large dataset.

Imagine then, your index as a phone book ordered by a column, say last name; but within the set of records with a similar last name, there isn't a common pattern or meaningful order for the records: they are ordered purely random. And say we need to search this record:

Ike Smith 4783 Random Ave. Seattle, WA 98117

Since the phone book is ordered by last name, we need only to go to the S, then the m, then the i, etc. until we find Smith. And (hopefully) there are only a couple of records under Smith so we find the one we want fairly quickly.

Now, imagine you have a phone book ordered by city instead of last name. And within the records that match a given city there is no particular order. And so we try our search again. However, once we find Seattle (using a extremely sophisticated binary search) we are left with close to 620,778 records, which we have to check sequentially as they ordered completely random. We're stuck checking every single entry for the record we want.

This is what happens when you use a very common column as the base of your index: the binary search returns a very large set of possible records with which the database can't make any assumptions beyond the initial indexed column values, so it needs to check sequentially within the resulting set for a matching record.

If the phone book were instead ordered by zip code (a less common variable), then you might find yourself only searching for 18,623 records residing on 98117.

Furthermore, a true phone book usually resembles a composite index: instead of just ordering by a single column (i.e. last name), the values within the resulting set are then ordered by another column (say first name), and then another (middle name?) so the search can be done sub linearly at every step until you find the record needed. It it basically an index within an index, where even if the first column is not that common, the combination with the second one provides a specific enough criteria that only a small set of records need to be search linearly.

answered Sep 19 '22 20:09

rae1

In general, the goal of the index is to provide faster than linear search by avoiding having to scan through a significant portion of the data in the table (see http://en.wikipedia.org/wiki/Database_index). If many of the would-be indexed values are identical, the database has to scan a significant portion of the table even after a successful index lookup.

So an index that has few unique values would provide very little performance benefit independent of its implementation.

answered Sep 19 '22 20:09

James

Related questions
                            
                                Find out the calling stored procedure in SQL Server
                            
                                Django test database looks empty while test is running
                            
                                Algorithm for finding visually similar photos from a database?
                            
                                Django: Retrieving IDs of manyToMany fields quickly
                            
                                Mongodb aggregation $group, restrict length of array
                            
                                Using getGeneratedKeys with batch inserts in MySQL with Connector/J
                            
                                django.contrib.gis.db.backends.postgis vs django.db.backends.postgresql_psycopg2
                            
                                Overcoming Windows Azure Sql Database 150 gb size limitation
                            
                                What is a good 'FizzBuzz' question for a SQL programmer?
                            
                                Database for Content - OK to store HTML?
                            
                                How to simplify this Laravel PHP code to one Eloquent query?
                            
                                what is the equivalent of CREATE VIEW IF NOT EXISTS in postresql
                            
                                Storing images in a database versus a filesystem
                            
                                Moq and SqlConnection?
                            
                                Entity Relationship Diagram. How does the IS A relationship translate into tables?
                            
                                Best practices and anti-patterns in creating indexes in SQL Server?
                            
                                Should I check for DB constraints in code or should I catch exceptions thrown by DB
                            
                                add_index to Data Model - Ruby on Rails Tutorial
                            
                                Laravel Queue, Beanstalkd vs Database, what are the differences?
                            
                                How do you keep two related, but separate, systems in sync with each other?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are indices on columns with very few unique values not effective?

Tags:

database

Henley

People also ask

2 Answers

rae1

James

Recent Activity

Donate For Us