I have a table with many rows, where I periodically want to set one column to 0
using a cron.
What is faster / less memory consuming, doing an UPDATE
on all rows (ie. no WHERE
clause) or doing an UPDATE
only WHERE mycolumn != 0
?
As noticed in comments on the original post, it depends on several things (index, database engine, type of storage media, available cache memory, etc.).
We could make an educated guess that:
a) We should always have a full-table scan unless we have an index on the column (and I'd not recommend one just for the sake of this query, as you will penalize down your online writes to benefit an offline process);
b) The "cost" for checking the values is highly amortized by the savings of not writing unchanged records (unless the database implicitly does so, which may or may not happen) if the majority of records will be zeroes.
but assumptions start to pile up, so I'd rather measure instead. To play a bit, I've:
Created a test table with a "status" numeric column
Filled it with a few million records (e.g., using a script like the one in https://stackoverflow.com/a/17268740/64635)
Set it up with different values, then tried to UPDATE
the column to 0, with and without the WHERE
.
My results (which may differ from yours) were that the WHERE query was way faster if there were indeed few non-zero records. E.g., after setting up the table with either of
UPDATE myTable SET myColumn = 1; /* All values non-zero (1) */
UPDATE myTable SET myColumn = FLOOR(RAND()*10); /* ~90% values non-zero */
both WHERE
and non-WHERE
updates to 0 were slow (and no noticeable difference between them, implying "a" above is true), whereas after any of
UPDATE myTable SET myColumn = 0; /* All values zero */
UPDATE myTable SET myColumn = IF(id % 500 = 0, 1, 0); /* 99.8% values zero */
the UPDATE
with WHERE
was insanely faster (as implied by "b").
I'd recommend trying these tests (and even others, including the index if you really wish) on your setup (e.g., creating a separate table and running tests like these) and consider your data set (measure/estimate the % of records that will be non-zero when your cron job runs). Keep in mind you likely want to optimize for cost/availability (including your time as a cost) instead of finding the absolutely most performant solution in the universe (which is likely not cost-effective), and you will surely find the best solution. Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With