Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL: Joins vs Denormalization (lots of data)

Tags:

sql

join

bigdata

I know, variations of this question had been asked before. But my case may be a little different :-)

So, I am building a site that tracks events. Each event has id and value. It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)

I need to be able to quickly get answers to two queries:

  • get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
  • get sum(maybe avg also) of values of events from users with certain profile -

Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.

Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.

I expect LOTS of data, certainly more than one table or single server can handle.

I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines). This way, every such table will have limited amount of rows, let's say, 10M tops.

So, the question is: what to do with user's attributes?

Option 1, normalized: store them in separate table and reference from event tables.

  • (pro) No repetition of data.
  • (con) joins, which are expensive (or so I heard).
  • (con) this requires user table and event tables to be on the same server

Option 2, redundant: store user attributes in event tables and index them.

  • (pro) easier load balancing (self-contained tables can be moved around)
  • (pro) simpler (faster?) queries
  • (con) lots of disk space and memory used for repeating user attributes and corresponding indexes
like image 866
Sergio Tulentsev Avatar asked Oct 11 '11 00:10

Sergio Tulentsev


People also ask

What are the limitations of having too much denormalization in a data model?

Disadvantages of Denormalization As data redundancy is there, update and insert operations are more expensive and take more time. Since we are not performing normalization, so this will result in redundant data. Data Integrity is not maintained in denormalization. As there is redundancy so data can be inconsistent.

How will you know if you should or shouldn't Denormalize?

You should always start from building a clean and high-performance normalized database. Only if you need your database to perform better at particular tasks (such as reporting) should you opt for denormalization. If you do denormalize, be careful and make sure to document all changes you make to the database.

Are join operations costly?

Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.

Should joins be avoided?

Joins are slow, avoid them if possible. You cannot avoid joins in all cases, joins are necessary for some tasks. If you want help with some query optimizing, please provide more details. Everything matters: query, data, indexes, plan, etc.


Video Answer


2 Answers

Your design should be normalized, you physical schema may end up denormalized for performance reasons.

Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.

Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.

In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.

like image 50
Godeke Avatar answered Oct 20 '22 14:10

Godeke


Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.

Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.

like image 42
JohnFx Avatar answered Oct 20 '22 15:10

JohnFx