Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redundant data in update statements

Hibernate generates UPDATE statements, which include all columns, regardless of whether I'm changing the value in that columns, eg:

tx.begin();
Item i = em.find(Item.class, 12345);
i.setA("a-value");
tx.commit();

issues this UPDATE statement:

update Item set A = $1, B = $2, C = $3, D = $4 where id = $5

So columns B, C, D are updated, while I didn't change them.

Say, Items are updated frequently and all columns are indexed. Does it make sense to optimize the Hibernate part to something like this?

tx.begin();
em.createQuery("update Item i set i.a = :a where i.id = :id")
    .setParameter("a", "a-value")
    .setParameter("id", 12345)
    .executeUpdate();
tx.commit();

What confuses me most is that the EXPLAIN plans of the 'unoptimized' and the 'optimized' query version are identical!

like image 431
Tair Avatar asked Oct 18 '11 10:10

Tair


People also ask

What is data redundancy with example?

A common example of data redundancy is when a name and address are both present in different columns within a table. If the link between these data points is defined in every single new database entry it would lead to unnecessary duplication across the entire table.

Why is redundant data in a database bad?

Redundant data is a bad idea because when you modify data (update/insert/delete), then you need to do it in more than one place. This opens up the possibility that the data becomes inconsistent across the database. The reason redundancy is sometimes necessary is for performance reasons.

What is data redundancy How can it be controlled?

For example, you moved your customer data into a new database but forgot to delete the same from the old one. In such a scenario, you will have the same data sitting in two places, just taking up the storage space. To reduce data redundancy, always delete databases that are no longer required.

How do you improve the performance of a update statement?

Best practices to improve SQL update statement performance We need to consider the lock escalation mode of the modified table to minimize the usage of too many resources. Analyzing the execution plan may help to resolve performance bottlenecks of the update query. We can remove the redundant indexes on the table.


1 Answers

Due to PostgreSQL MVCC, an UPDATE is effectively much like a DELETE plus an INSERT. With the notable exception of toasted values - see:

  • Does Postgres rewrite entire row on update?

(And minor differences for heap only tuples - DELETE + INSERT starts a new HOT chain - but that has no bearing on the case at hand.)

To be precise, the "deleted" row is just invisible to any transaction starting after the delete has been committed, and vacuumed later. Therefore, on the database side, including index manipulation, there is in effect no difference between the two statements. (Exceptions apply, keep reading.) It increases network traffic a bit (depending on your data) and needs a bit of parsing.

I studied HOT updates some more after @araqnid's input and ran some tests. Updates on columns that don't actually change the value make no difference whatsoever as far as HOT updates are concerned. My answer holds. See details below.

This also applies to toasted attributes, since those are also not touched unless the values actually change.

However, if you use per-column triggers (introduced with pg 9.0), this may have undesired side effects!

I quote the manual on triggers:

... a command such as UPDATE ... SET x = x ... will fire a trigger on column x, even though the column's value did not change.

Bold emphasis mine.

Abstraction layers are for convenience. They are useful for SQL-illiterate developers or if the application needs to be portable between different RDBMS. On the downside, they can butcher performance and introduce additional points of failure. I avoid them wherever possible.

HOT (Heap-only tuple) updates

Heap-Only Tuples were introduced with Postgres 8.3, with important improvements in 8.3.4 and 8.4.9.
The release notes for Postgres 8.3:

UPDATEs and DELETEs leave dead tuples behind, as do failed INSERTs. Previously only VACUUM could reclaim space taken by dead tuples. With HOT dead tuple space can be automatically reclaimed at the time of INSERT or UPDATE if no changes are made to indexed columns. This allows for more consistent performance. Also, HOT avoids adding duplicate index entries.

Emphasis mine. And "no changes" includes cases where columns are updated with the same value as they already hold. I actually tested, as I wasn't sure.

Ultimately, the extensive README.HOT in the source code confirms it.

Toasted columns also don't stand in the way of HOT updates. The HOT-updated tuple just links to the same, unchanged tuple(s) in the toast fork of the relation. HOT updates even work with toasted values in the target list (actually changed or not). If toasted values are changed, it entails writes to the toast relation fork, obviously. I tested all of that, too.

Don't take my word for it, see for yourself. Postgres provides a couple of functions to check statistics. Run your UPDATE with and without all columns and check if it makes any difference.

-- Number of rows HOT-updated in table:
SELECT pg_stat_get_tuples_hot_updated('table_name'::regclass::oid)

-- Number of rows HOT-updated in table, in the current transaction:
SELECT pg_stat_get_xact_tuples_hot_updated('table_name'::regclass::oid)

Or use pgAdmin. Select your table and inspect the "Statistics" tab in the main window.

Be aware that HOT updates are only possible when there is room for the new tuple version on the same page of the main relation fork. One simple way to force that condition is to test with a small table that holds only a few rows. Page size is typically 8k, so there must be free space on the page.

like image 52
Erwin Brandstetter Avatar answered Oct 08 '22 05:10

Erwin Brandstetter