I'm looking for some "inference rules" (similar to set operation rules or logic rules) which I can use to reduce a SQL query in complexity or size. Does there exist something like that? Any papers, any tools? Any equivalencies that you found on your own? It's somehow similar to query optimization, but not in terms of performance.
To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
So, I'm looking for equivalent transformations of SQL statements like the fact that most SUBSELECTs can be rewritten as a JOIN.
Structured Query Language (SQL) is a standardized programming language that is used to manage relational databases and perform various operations on the data in them.
To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
That's exactly what optimizers do for a living (not that I'm saying they always do this well).
Since SQL
is a set based language, there are usually more than one way to transform one query to other.
Like this query:
SELECT * FROM mytable WHERE col1 > @value1 OR col2 < @value2
can be transformed into this:
SELECT * FROM mytable WHERE col1 > @value1 UNION SELECT * FROM mytable WHERE col2 < @value2
or this:
SELECT mo.* FROM ( SELECT id FROM mytable WHERE col1 > @value1 UNION SELECT id FROM mytable WHERE col2 < @value2 ) mi JOIN mytable mo ON mo.id = mi.id
, which look uglier but can yield better execution plans.
One of the most common things to do is replacing this query:
SELECT * FROM mytable WHERE col IN ( SELECT othercol FROM othertable )
with this one:
SELECT * FROM mytable mo WHERE EXISTS ( SELECT NULL FROM othertable o WHERE o.othercol = mo.col )
In some RDBMS
's (like PostgreSQL
), DISTINCT
and GROUP BY
use the different execution plans, so sometimes it's better to replace one with the other:
SELECT mo.grouper, ( SELECT SUM(col) FROM mytable mi WHERE mi.grouper = mo.grouper ) FROM ( SELECT DISTINCT grouper FROM mytable ) mo
vs.
SELECT mo.grouper, SUM(col) FROM mytable GROUP BY mo.grouper
In PostgreSQL
, DISTINCT
sorts and GROUP BY
hashes.
MySQL
lacks FULL OUTER JOIN
, so it can be rewritten as folloing:
SELECT t1.col1, t2.col2 FROM table1 t1 LEFT OUTER JOIN table2 t2 ON t1.id = t2.id
vs.
SELECT t1.col1, t2.col2 FROM table1 t1 LEFT JOIN table2 t2 ON t1.id = t2.id UNION ALL SELECT NULL, t2.col2 FROM table1 t1 RIGHT JOIN table2 t2 ON t1.id = t2.id WHERE t1.id IS NULL
, but see this article in my blog on how to do this more efficiently in MySQL
:
FULL OUTER JOIN
in MySQLThis hierarchical query in Oracle
:
SELECT DISTINCT(animal_id) AS animal_id FROM animal START WITH animal_id = :id CONNECT BY PRIOR animal_id IN (father, mother) ORDER BY animal_id
can be transformed to this:
SELECT DISTINCT(animal_id) AS animal_id FROM ( SELECT 0 AS gender, animal_id, father AS parent FROM animal UNION ALL SELECT 1, animal_id, mother FROM animal ) START WITH animal_id = :id CONNECT BY parent = PRIOR animal_id ORDER BY animal_id
, the latter one being more performant.
See this article in my blog for the execution plan details:
To find all ranges that overlap the given range, you can use the following query:
SELECT * FROM ranges WHERE end_date >= @start AND start_date <= @end
, but in SQL Server
this more complex query yields same results faster:
SELECT * FROM ranges WHERE (start_date > @start AND start_date <= @end) OR (@start BETWEEN start_date AND end_date)
, and believe it or not, I have an article in my blog on this too:
SQL Server
also lacks an efficient way to do cumulative aggregates, so this query:
SELECT mi.id, SUM(mo.value) AS running_sum FROM mytable mi JOIN mytable mo ON mo.id <= mi.id GROUP BY mi.id
can be more efficiently rewritten using, Lord help me, cursors (you heard me right: cursors
, more efficiently
and SQL Server
in one sentence).
See this article in my blog on how to do it:
There is a certain kind of query commonly met in financial applications that searches for the effective rate for a currency, like this one in Oracle
:
SELECT TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999') FROM t_transaction x JOIN t_rate r ON (rte_currency, rte_date) IN ( SELECT xac_currency, MAX(rte_date) FROM t_rate WHERE rte_currency = xac_currency AND rte_date <= xac_date )
This query can be heavily rewritten to use an equality condition which allows a HASH JOIN
instead of NESTED LOOPS
:
WITH v_rate AS ( SELECT cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate FROM ( SELECT cur_id, dte_date, ( SELECT MAX(rte_date) FROM t_rate ri WHERE rte_currency = cur_id AND rte_date <= dte_date ) AS rte_effdate FROM ( SELECT ( SELECT MAX(rte_date) FROM t_rate ) - level + 1 AS dte_date FROM dual CONNECT BY level <= ( SELECT MAX(rte_date) - MIN(rte_date) FROM t_rate ) ) v_date, ( SELECT 1 AS cur_id FROM dual UNION ALL SELECT 2 AS cur_id FROM dual ) v_currency ) v_eff LEFT JOIN t_rate ON rte_currency = cur_id AND rte_date = rte_effdate ) SELECT TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999') FROM ( SELECT xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt FROM t_transaction x GROUP BY xac_currency, TRUNC(xac_date) ) JOIN v_rate ON eff_currency = xac_currency AND eff_date = xac_date
Despite being bulky as a hell, the latter query is 6
times faster.
The main idea here is replacing <=
with =
, which requires building an in-memory calendar table. to JOIN
with.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With