Suppose I have a table called Projects
with a column called Budget
with a standard B-Tree index. The table has 50,000 projects, and only 1% of them have a Budget of over one million. If I ran the SQL Query:
SELECT * From Projects WHERE Budget > 1000000;
The planner will use an index range scan on Budget
to get the rows off the heap table. However, if I use the query:
SELECT * From Projects WHERE Budget > 50;
The planner will most likely do a sequential scan on the table, as it will know this query will end up returning most or all rows anyway and there's no reason to load all the pages of the index into memory.
Now, let's say I run the query:
SELECT * From Projects WHERE Budget > :budget;
Where :budget
is a bind parameter passed into my database. From what I've read, the query as above will be cached, and no data on cardinality can be inferred. In fact, most databases will just assume an even distribution and the cached query plan will reflect that. This surprised me, as usually when you read about the benefits of bind parameters it's on the subject of preventing SQL injection attacks.
Obviously, this could improve performance if the resulting query plan would be the same, as a new plan wouldn't have to be compiled, but could also hurt performance if the values of :budget
greatly varied.
My Question: Why are bind parameters not resolved before the query plan is generated and cached? Shouldn't modern databases strive to generate the best plan for the query, which should mean looking at the value for each parameter and getting accurate index stats?
Note: This question probably doesn't apply to mySql as mySql doesn't cache SQL plans. However, I'm interested in why this is the case on Postgres, Oracle and MS SQL.
For Oracle specifically, it depends.
For quite some time (at least 9i), Oracle has supported bind variable peeking. That means that the first time a query is executed, the optimizer peeks at the value of the bind variable and bases its cardinality estimates on the value of that first bind variable. That makes sense in cases where most of the executions of a query are going to have bind variable values that return similarly sized results. If 99% of the queries are using small budget values, it is highly likely that the first execution will use a small value and thus the cached query plan will be appropriate for small bind variable values. Of course, that means that when you do specify a large bind variable value (or, worse, if you get lucky and the first execution is with a large value) you'll get less than optimal query plans.
If you are using 11g, Oracle can use adaptive cursor sharing. This allows the optimizer to maintain multiple query plans for a single query and to pick the appropriate plan based on the bind variable values. That can get rather complicated over time, though. If you have a query with N bind variables, the optimizer has to figure out how to partition that N-dimensional space into different query plans for different bind variable values in order to figure out when and whether to re-optimize a query for a new set of bind variable values and when to simply reuse an earlier plan. A lot of that work ends up being done at night during the nightly maintenance window in order to avoid incurring those costs during the productive day. But that also brings up issues about how much freedom the DBA wants to give the database to evolve plans over time vs how much the DBA wants to control plans so that the database doesn't suddenly start picking a poor plan that causes some major system to slow to a crawl on a random day.
This surprised me, as usually when you read about the benefits of bind parameters it's on the subject of preventing SQL injection attacks.
Don't confuse parameterized queries with prepared statements. Both offer parameterization, but prepared statements offer the additional caching of the query plan.
Why are bind parameters not resolved before the query plan is generated and cached?
Because sometimes generating the query plan is an expensive step. Prepared statements allow you to amortize the cost of query planning.
However, if all you're looking for is SQL injection protection, don't use prepared statements. Use parameterized queries.
For example, in PHP, you can use http://php.net/pg_query_params to execute a parameterized query WITHOUT caching the query plan; meanwhile http://php.net/pg_prepare and http://php.net/pg_execute are used to cache a plan for a prepared statement and later execute it.
Edit: 9.2 apparently changes the way prepared statements are planned
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With