Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices for Denormalizing data from Relational to non-relational DBs

I'm running a website that's starting to grow beyond simple performance and Tunning. It's a PHP app with MySQL as backend. MySQL is properly tunned and the code is optimized.

The thing is that i see that i can use some sort of denormalization to speed things up.

Suppose you have a site similar to ebay or Amazon. You have products in your database with some information related (seller, customers who bought the product, city, state, etc). That would be multiple tables in a Relational DataBase, and is good to keep this way to make good querys. But, for example, for the home page, you could have one single denormalized document (for example in MongoDB). Could be a collection with the latest products, denormalied, similar to this:

products = {
   {
      id:13,
      name:"Some product",
      city:"aCity",
      state:"aState",
      price:"10"
   },
   {
      id:123,
      name:"another product",
      city:"aCity",
      state:"aState",
      price:"10"
   }
}

This way, I could query that collection instead of the MySQL database (with all the joins involved) and things could get really fast.

Now, here is the question. When and how would you denormalize that data? For example, i could decide that I want to denormalize the data when it's inserted.

So, in my "create-product.php" (simply put). I could do all the "insert into" for mysql, and after that i could do the save to the Mongo collection.

Or, i could just run a program in the server. Or make some cron to look for the latest products.

All these are posibilities. What do you do? What is your expirience?

Thanks a lot.

like image 966
santiagobasulto Avatar asked Oct 11 '22 09:10

santiagobasulto


1 Answers

Conceptually you are creating some kind of a cache, and you're forseeing that populating it is going to be time-expensive, hence you want to have it persistent, on the reasonable assumption that loading from your persisted cache is going to faster than going back to the real DB.

There are some variations on your idea, caching HTML page fragements or JSON strings, and using a distributed in-memory cache - not persistent but fault-tolerant.

The big question with all caching solutions is: "how stale can I afford to be?". For some data being 24 hours out of date doesn't really matter. For example: Top 10 most popular books? Latest reviews, for those just some batch update will do. For more urgent stuff you may well need to ensure that there's a more rapid update, but you really want to avoid putting too much extra processing in the mainstream. For example it would be a shame to give a customer a slow purchasing experience because he's waiting for an update to a cache. In those cases you might just drop a "Here's an update" message on a queue, or indeed a "your entry nunber 23 is now stale" message, let the cache pick that up as its leisure and if need be refresh itself.

like image 123
djna Avatar answered Oct 20 '22 03:10

djna