Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In a database, when should you store derived data?

My question is about denormalization. In a database, when should you store derived data in its own column, rather than calculating it every time you need it?

For example, say you have Users who get Upvotes for their Questions. You display a User's reputation on their profile. When a User is Upvoted, should you increment their reputation, or should you calculate it when you retrieve their profile:

SELECT User.id, COUNT(*) AS reputation FROM User
LEFT JOIN Question
  ON Question.User_id = User.id
LEFT JOIN Upvote
  ON Upvote.Question_id = Question.id
GROUP BY User.id

How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?

To continue our example, suppose an Upvote has a weight that depends on how many Upvotes (not how much reputation) the User who cast it has. The query to retrieve their reputation suddenly explodes:

SELECT
  User.id AS User_id,
  SUM(UpvoteWeight.weight) AS reputation
FROM User
LEFT JOIN Question
  ON User.id = Question.User_id
LEFT JOIN (
  SELECT
    Upvote.Question_id,
    COUNT(Upvote2.id)+1 AS weight
  FROM Upvote
  LEFT JOIN User
    ON Upvote.User_id = User.id
  LEFT JOIN Question
    ON User.id = Question.User_id
  LEFT JOIN Upvote AS Upvote2
    ON
      Question.id = Upvote2.Question_id
      AND Upvote2.date < Upvote.date
  GROUP BY Upvote.id
) AS UpvoteWeight ON Question.id = UpvoteWeight.Question_id
GROUP BY User.id

This is far out of proportion with the difficulty of an incremental solution. When would normalization be worth it, and when do the benefits of normalization lose to the benefits of denormalization (in this case query difficulty and/or performance)?

like image 794
theazureshadow Avatar asked Nov 08 '10 19:11

theazureshadow


People also ask

What is derived data in database?

A derived data element is a data element derived from other data elements using a mathematical, logical, or other type of transformation, e.g. arithmetic formula, composition, aggregation.

What is derived data used for?

Derived data provides critical insights not readily apparent in the original data. Instead of being limited to the static observations of direct data, derived data moves beyond the raw data to make new connections and extrapolates new use cases.

What should I store in a database?

The purpose of every database is to store information, texts, images, even media files. All dynamic modern websites rely on one or more databases for storing articles and other published content, information about the users, contact information, connections to other websites, ads, etc.

Should I store data on database?

Anything that you need to store permanently should go in a database. Relational databases are typically the most mature databases: they have withstood the test of time and continue to be an industry standard tool for the reliable storage of important data.


2 Answers

How processor intensive does the query to get a User's reputation have to be before it would be worthwhile to keep track of it incrementally with its own column?

There really are two questions here in guise of one: (1) Will this change improve the performace and (2) Will the performance improvement be worth the effort?


As far as whether the performance improvement, this is basically a standard pros/cons analysis.

The benefits of normalization are basically two-fold:

  • Easier data integrity

  • No issues with re-calculation (e.g. if the underlying data changes, the derived column needs to be re-calculated).

If you cover the data integrity with a robustly implemented solution (e.g. trigger, Sstored-proc-only data changes with revoked direct table change perms, etc...), then this becomes a straightforward calculation of whether the cost of verifying whether the source data change warrants the derived data re-calculation vs. recalculating the derived data every time. (NOTE: Another approach to keeping data integrity is to force the recalculation of derived data on schedule, where that data can afford to be inaccurate with some time tolerance. StackExchange takes this approach with some of its numbers).

In a typical scenario (many more retrieval of data and far less changes to the underlying data) the math pretty obviously skews in favor of keeping de-normalized derived data in the table.

In some rare cases where the underlying data changes VERY often yet the derived data is not retrieved that often, doing that might be detrimental.


Now, we are onto the far more important question: Will the performance improvement be worth the effort?

Please note that, as with ALL optimizations, the biggest question is "is the optimization even worth it at all?", and as such is the subject to two main considerations:

  1. Measuring exact performance difference and generally profiling.

  2. Context of this specific optimization in the big picture of your system.

E.g. if the difference in query performace - which as always when optimizing must first be measured - is 2% between cached derived data and computed one, the extra system complexity in implementing the reputation cache column may not be worth it in the first place. But what the threshold of caring vs. not caring is as far as marginal improvement depends on the big picture of your app. If you can take steps to improve query performance 10% in a different place, concentrate on that vs. 2%. If you're Google and extra 2% of query performance carries cost of 2 billion dollars in extra hardware to bear it, it needs to be optimized anyway.

like image 59
DVK Avatar answered Oct 16 '22 17:10

DVK


There is really no clear-cut answer because it depends a lot of factors like the volume of the site and how often you display the reputation (i.e. only on their profile page or next to EVERY instance of their user name, everywhere). The only real answer is "when it gets too slow"; in other words, you would probably need to test both scenarios and get some real-world perfromance stats.

Personally I'd denormalize in this particular situation and have either an insert trigger on the upvote table or a periodic update query that updates the denromalized reputation column. Would it really be the end of the world is someone's rep said "204" instead of "205" until the page refreshes?

like image 20
Paul Abbott Avatar answered Oct 16 '22 16:10

Paul Abbott