Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mySQL Efficiency Issue - How to find the right balance of normalization...?

I'm fairly new to working with relational databases, but have read a few books and know the basics of good design.

I'm facing a design decision, and I'm not sure how to continue. Here's a very over simplified version of what I'm building: People can rate photos 1-5, and I need to display the average votes on the picture while keeping track of the individual votes. For example, 12 people voted 1, 7 people voted 2, etc. etc.

The normalization freak of me initially designed the table structure like this:

Table pictures
id* | picture | userID | 

Table ratings
id* | pictureID | userID | rating

With all the foreign key constraints and everything set as they shoudl be. Every time someone rates a picture, I just insert a new record into ratings and be done with it.

To find the average rating of a picture, I'd just run something like this:

SELECT AVG(rating) FROM ratings WHERE pictureID = '5' GROUP by pictureID 

Having it setup this way lets me run my fancy statistics to. I can easily find who rated a certain picture a 3, and what not.

Now I'm thinking if there's a crapload of ratings (which is very possible in what I'm really designing), finding the average will became very expensive and painful.

Using a non-normalized version would seem to be more efficient. e.g.:

Table picture
id | picture | userID | ratingOne | ratingTwo | ratingThree | ratingFour | ratingFive

To calculate the average, I'd just have to select a single row. It seems so much more efficient, but so much more uglier.

Can someone point me in the right direction of what to do? My initial research shows that I have to "find the right balance", but how do I go about finding that balance? Any articles or additional reading information would be appreciated as well.

Thanks.

like image 814
Foo Avatar asked Mar 21 '10 08:03

Foo


3 Answers

Your normalized approach makes a lot of sense, the denormalized one doesn't.


In my experience (Telco Performance Management, hundreds of thousands of datapoints per 1/4 hour) we would do the following:

Table: pictures
id* | picture | userID | avg_rating | rating_count

Table: ratings
id* | pictureID | userID | rating

For the telco the pictures rating would be re-calculated once daily, you should do it periodical (e.g. hourly )or every time you insert (re-calc for the picture rated, not the entire table). This depends on the amounts of ratings you get.


In the telco we also keep the rating-date in what is your 'pictures' table and a 1/4h timestamp in the ratings table, but I don't think you need that level of detail.


The 'denormalization' is to move a calculateable fact (count (rating) and avg(rating)) to the pictures table. This saves CPU cycles, but costs more storage.

like image 83
lexu Avatar answered Oct 01 '22 21:10

lexu


this is how i would approach the problem http://pastie.org/879604

drop table if exists picture;
create table picture
( 
 picture_id int unsigned not null auto_increment primary key,
 user_id int unsigned not null, -- owner of the picture, the user who uploaded it
 tot_votes int unsigned not null default 0, -- total number of votes 
 tot_rating int unsigned not null default 0, -- accumulative ratings 
 avg_rating decimal(5,2) not null default 0, -- tot_rating / tot_votes
 key picture_user_idx(user_id)
)engine=innodb;

insert into picture (user_id) values 
 (1),(2),(3),(4),(5),(6),(7),(1),(1),(2),(3),(6),(7),(7),(5);


drop table if exists picture_vote;
create table picture_vote
( 
 picture_id int unsigned not null,
 user_id int unsigned not null,-- voter
 rating tinyint unsigned not null default 0, -- rating 0 to 5
 primary key (picture_id, user_id)
)engine=innodb;

delimiter #

create trigger picture_vote_before_ins_trig before insert on picture_vote
for each row
begin
 declare total_rating int unsigned default 0;
 declare total_votes int unsigned default 0;

 select tot_rating + new.rating, tot_votes + 1 into total_rating, total_votes 
   from picture where picture_id = new.picture_id;

 -- counts/stats
 update picture set
    tot_votes = total_votes, tot_rating = total_rating, 
    avg_rating = total_rating / total_votes
 where picture_id = new.picture_id;

 end#
 delimiter ;

hope this helps :)

like image 45
Jon Black Avatar answered Oct 01 '22 19:10

Jon Black


What would these ratingOne to ratingFive fields contain? The number of votes received? Then you won't know who cast the vote. If you really do need to denormalize, I'd just add an "average rating" field to the the picture table, and update that whenever a vote is cast (and keep the ratings table as is).

More generally, don't get caught in premature optimalisation. Try writing a test script which creates 100.000 pictures and 1 million ratings (or whatever figure you want to support), and see how long your AVG query takes. Chances are it will still be plenty fast. Make sure your "ratings" table has an index on pictureID, so the DB doesnt need to traverse the million rows.

like image 37
Alexander Malfait Avatar answered Oct 01 '22 20:10

Alexander Malfait