SQL: Joins vs Denormalization (lots of data)

Tags:

I know, variations of this question had been asked before. But my case may be a little different :-)

So, I am building a site that tracks events. Each event has id and value. It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)

I need to be able to quickly get answers to two queries:

get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
get sum(maybe avg also) of values of events from users with certain profile -

Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.

Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.

I expect LOTS of data, certainly more than one table or single server can handle.

I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines). This way, every such table will have limited amount of rows, let's say, 10M tops.

So, the question is: what to do with user's attributes?

Option 1, normalized: store them in separate table and reference from event tables.

(pro) No repetition of data.
(con) joins, which are expensive (or so I heard).
(con) this requires user table and event tables to be on the same server

Option 2, redundant: store user attributes in event tables and index them.

(pro) easier load balancing (self-contained tables can be moved around)
(pro) simpler (faster?) queries
(con) lots of disk space and memory used for repeating user attributes and corresponding indexes

866

asked Oct 11 '11 00:10

Sergio Tulentsev

Video Answer

2 Answers

Your design should be normalized, you physical schema may end up denormalized for performance reasons.

Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.

Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.

In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.

answered Oct 20 '22 14:10

Godeke

Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.

Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.

answered Oct 20 '22 15:10

JohnFx

Related questions
                            
                                A Sql Query to search and replace specific prefixed strings?
                            
                                Using IF construction in Hibernate
                            
                                database caching with codeigniter problems
                            
                                Understanding this SQL Query
                            
                                Conversion failed when converting date and/or time from character string
                            
                                SQL query INSERT not working inserting values into my DB [duplicate]
                            
                                How to find current database type
                            
                                AGE [1, 2, 3] vs. AGE BETWEEN 1 AND 3
                            
                                why does adding an index on something you sort by decrease the amount of work in sorting?
                            
                                DataReader ordinal-based lookups vs named lookups
                            
                                MySQL JOIN and COUNT in single query
                            
                                update random numbers for top 100 rows in sql?
                            
                                Different ways to alias a column
                            
                                Select row which has apostrophe value postgresql
                            
                                How to compare varbinary in where clause in SQL Server
                            
                                How can I use a CASE statement in a WHERE clause with IS NULL?
                            
                                SQL Update column values using subquery
                            
                                What is better - SELECT TOP (1) or INNER JOIN?
                            
                                Sql progressive sum
                            
                                Set a default return value for a Postgres function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL: Joins vs Denormalization (lots of data)

Tags:

sql

join

bigdata