Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing & alternatives for low-selectivity columns

What are the range of tactics available for selecting records on low selectivity columns?

An example might be an orders table where, over many years, you build up a large number of completed orders but often need to select active orders. An order might go through a lifecycle such as placed, stock-allocated, picked from warehouse, despatched to customer, invoiced and paid. An order might additionally be cancelled, held, etc. The majority of records will eventually be in the final state (e.g. paid) but you might often need to select, say, allocated orders. In this case a sequential read would be slow.

Similar questions on indexing
MySQL: low cardinality/selectivity columns = how to index?
Do indexes suck in SQL?
What are indexes and how can I use them to optimize queries in my database?
Defining indexes: Which Columns, and Performance Impact?
and numerous others decreasingly related.

The approaches I have read about (in stackoverflow and elsewhere) include

  • Use a bitmap index
  • Use a partial index (create index x on t(c2) where c1='a')
  • Use a clustered index?
  • Don't index low selectivity columns, use sequential read
  • Partition the data (e.g. into several tables with identical schema)
  • Use a supplementary table (e.g. active_customers(customer_id)

My current DBMS doesn't support the first three options listed above and the remainder seem problematic - are there any other commonly used approaches?

Update: I've seen - index your low-selectivity column, but only ever select for high-selectivity values.

like image 914
RedGrittyBrick Avatar asked Nov 15 '10 14:11

RedGrittyBrick


1 Answers

I agree with Unreason's However branch. But there are some things to know about this case.

This is called skew and skew kills. This is a perfect use for a partial index where you'd exclude the 95% of paid invoices and only index the more interesting and selective stats. But you don't have that. You can horizontally partition all the rows into separate table/partitions but then you need to account for row migration (moving from one status to another) and that's expensive. The DBMS has to perform an Update, a Delete and an insert to change the status. If you're a high volume system that will hurt.

Forget what you said about whether or not to index based on selectivity because putting an index on a rapidly changing column is also usually a bad idea. Your index will have hot blocks where all the step 1's are being removed and another where all the step 2's are being inserted and oh btw, some step 2's are being removed at the same time into step 3's. This won't scale well.

I would recommend vertically partitioning your status into a separate table(s).

Your invoice table will have a PK and all the columns except status.

Your status you can handle two ways. That table will have the PK value as an FK back to the invoice table, the Status and a timestamp for when you entered that status. The best is a horizontally partitioned table on status. You'll have a partition for each status possible. So finding all or one "Placed" status will partition prune and read only the partition it needs - which is a very small number of blocks. Because the row is so narrow, you might get 400 invoice statuses on a single block. Looking up that status of any one invoice is easy since there's a global index on the PK.

If your RDBMS doesn't support partitioning with row migration, you'll need to manage these partitions as tables and delete from one and insert into another. You'll encapsulate these movements in a transaction in a procedure, so you keep the data clean. Every invoice is in one and only one status table. The harder part is querying by invoice ID, you'll have to check every table to see where it is.

You have another choice You can either write paid statuses or not. If it's a partitioned table, you can just delete the invoice from the invoice status table when it moves to paid. (Of course you'll write a paid record to the history table mentioned in the bonus material). Then you'll do an outer join to the status table and nulls mean paid. If you almost never query for paid status, there's really no reason to make that a fast query.

Bonus Material

in either case you'll want to keep track of these movements in a reporting table. Everytime you update a status, you'll want to write that to a history table. Eventually you'll want to analyze what I call transit times. What's the average time from filled to paid, by month? Is that increasing as a result of the bad economy? what's the transit time from placed to filled, by month. Do the summer months take longer because of missing bodies on vacation? you get the point. By updating that column you're losing those answers, so you'll need to embed that history log into your procedures.

like image 171
Stephanie Page Avatar answered Oct 25 '22 00:10

Stephanie Page