Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL: low cardinality/selectivity columns = how to index?

Tags:

database

mysql

I need to add indexes to my table (columns) and stumbled across this post:

How many database indexes is too many?

Quote: “Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”

Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)

Id (BIGINT) fullname (VARCHAR) address (VARCHAR) status (VARCHAR) 

Further conditions:

  • The Database contains 300 Million records
  • Status can only be “enabled” and “disabled”
  • 150 Million records have status= enabled and 150 Million records have stauts= disabled

My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?

How efficient is the lookup when I use a BTREE index on status?

Should I index this column or not?

What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?

like image 879
Jan Avatar asked Mar 05 '10 13:03

Jan


People also ask

What type of indexing technique is suitable for low selectivity data?

Out of all the approaches you have listed only one (use sequential read) is approach that has anything to do with low selectivity (well, clustered can qualify, too). If you have low selectivity on a column this means that scans will perform better than lookup.

Is index useful for low cardinality?

Cardinality is important — cardinality means the number of distinct values in a column. If you create an index in a column that has low cardinality, that's not going to be beneficial since the index should reduce search space. Low cardinality does not significantly reduce search space.

What is low cardinality index?

Low-cardinality refers to columns with few unique values. Low-cardinality column values are typically status flags, Boolean values, or major classifications such as gender. An example of a data table column with low-cardinality would be a CUSTOMER table with a column named NEW_CUSTOMER.

Which of the columns are least suitable for indexes?

A GUID column is not the best candidate for indexing. Indexes are best suited to columns with a data type that can be given some meaningful order, ie sorted (integer, date etc). It does not matter if the data in a column is generally increasing.


2 Answers

The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.

The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.

In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.

Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.

There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.

Reference: http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html

like image 60
a'r Avatar answered Sep 22 '22 13:09

a'r


I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.

This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.

I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.

If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.

It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)

like image 40
extraneon Avatar answered Sep 21 '22 13:09

extraneon