There is often a relationship or correlation between the data stored in different columns of the same table. For example, in the customers table, the values in the c_state column are influenced by the values in the country_id column, as the state of XYZ is only going to be found in the country ABC.
I think PostgreSQL assumes predicates are mutually independent and selectivities for each predicate on the same relation are multiplied together. Hence selectivity estimates will be much smaller than actuals and non-optimal access paths may be chosen when data is highly dependent and skewed. How to avoid this in PostgreSQL?
Can we create some kind of multi-column statistics on a group of columns in PostgreSQL 9.3.5. Is there any support for Multi-dimensional histograms?
You're correct, Pg assumes independence, and this can be a problem when there's a correlation. The analyser doesn't know how to find correlations and the optimiser doesn't know how to use any such information even if the analyser could find it. There's no support for multi-dimensional histograms or storing cross-column correlation information.
There have been many discussions about this on the pgsql-hackers and pgsql-general lists, but no firm conclusions about how to deal with it have been reached. Additionally, almost nobody who has issues with this is willing to invest the time (or funding) into actually solving the problem.
Here's a relevant recent article. The optimizer hints wiki page also covers some correlation issues.
Some mailing list discussions (an extremely non-exhaustive list) include:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With