Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how cross column correlation is handled by PostgreSQL Optimizer

There is often a relationship or correlation between the data stored in different columns of the same table. For example, in the customers table, the values in the c_state column are influenced by the values in the country_id column, as the state of XYZ is only going to be found in the country ABC.

I think PostgreSQL assumes predicates are mutually independent and selectivities for each predicate on the same relation are multiplied together. Hence selectivity estimates will be much smaller than actuals and non-optimal access paths may be chosen when data is highly dependent and skewed. How to avoid this in PostgreSQL?

Can we create some kind of multi-column statistics on a group of columns in PostgreSQL 9.3.5. Is there any support for Multi-dimensional histograms?

like image 947
CRM Avatar asked Feb 13 '23 01:02

CRM


1 Answers

You're correct, Pg assumes independence, and this can be a problem when there's a correlation. The analyser doesn't know how to find correlations and the optimiser doesn't know how to use any such information even if the analyser could find it. There's no support for multi-dimensional histograms or storing cross-column correlation information.

There have been many discussions about this on the pgsql-hackers and pgsql-general lists, but no firm conclusions about how to deal with it have been reached. Additionally, almost nobody who has issues with this is willing to invest the time (or funding) into actually solving the problem.

Here's a relevant recent article. The optimizer hints wiki page also covers some correlation issues.

Some mailing list discussions (an extremely non-exhaustive list) include:

  • how to implement selectivity injection in postgresql - recent
  • Cross column correlation revisited
  • How to specify/mock the statistic data of tables in PostgreSQL
like image 83
Craig Ringer Avatar answered Feb 15 '23 09:02

Craig Ringer