Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should we denormalize database to improve performance?

We have a requirement to store 500 measurements per second, coming from several devices. Each measurement consists of a timestamp, a quantity type, and several vector values. Right now there is 8 vector values per measurement, and we may consider this number to be constant for needs of our prototype project. We are using HNibernate. Tests are done in SQLite (disk file db, not in-memory), but production will probably be MsSQL.

Our Measurement entity class is the one that holds a single measurement, and looks like this:

public class Measurement
{
    public virtual Guid Id { get; private set; }
    public virtual Device Device { get; private set; }
    public virtual Timestamp Timestamp { get; private set; }
    public virtual IList<VectorValue> Vectors { get; private set; }
}

Vector values are stored in a separate table, so that each of them references its parent measurement through a foreign key.

We have done a couple of things to ensure that generated SQL is (reasonably) efficient: we are using Guid.Comb for generating IDs, we are flushing around 500 items in a single transaction, ADO.Net batch size is set to 100 (I think SQLIte does not support batch updates? But it might be useful later).

The problem

Right now we can insert 150-200 measurements per second (which is not fast enough, although this is SQLite we are talking about). Looking at the generated SQL, we can see that in a single transaction we insert (as expected):

  • 1 timestamp
  • 1 measurement
  • 8 vector values

which means that we are actually doing 10x more single table inserts: 1500-2000 per second.

If we placed everything (all 8 vector values and the timestamp) into the measurement table (adding 9 dedicated columns), it seems that we could increase our insert speed up to 10 times.

Switching to SQL server will improve performance, but we would like to know if there might be a way to avoid unnecessary performance costs related to the way database is organized right now.

[Edit]

With in-memory SQLite I get around 350 items/sec (3500 single table inserts), which I believe is about as good as it gets with NHibernate (taking this post for reference: http://ayende.com/Blog/archive/2009/08/22/nhibernate-perf-tricks.aspx).

But I might as well switch to SQL server and stop assuming things, right? I will update my post as soon as I test it.

[Update]

I've moved to SQL server and flattened my hierarchy, I tested it by storing 3000 measurements/sec for several hours and it seems to be working fine.

like image 714
Groo Avatar asked May 03 '10 11:05

Groo


2 Answers

Personally, I'd say go for it: denormalize, and then create an ETL process to bring this data into a more normalized format for analysis/regular use.

Basically the ideal situation for you might be to have a separate database (or even just separate tables in the same database if need be) that treats the acquisition of data as an entirely separate matter from having it in the format in which you need to process it.

That doesn't mean that you need to throw away the entities that you've created around your current database structure: just that you should also create those denormalized tables and make an ETL to bring them in. You could use SSIS (though it's still quite buggy and irritable) to bring the data into your normalized set of tables periodically, or even a C# app or other bulk loading process.

EDIT: This is assuming, of course, that your analysis doesn't need to be done in real time: just the collection of data. Quite often, people don't need (and sometimes, would actually prefer not to have) real time updating of analysis data. It's one of those things that sounds good on paper, but in practice it's unnecessary.

If some people who analyze this data require real time access, you could build a toolset against the "bare metal" denormalized transactional data if desired: but quite frequently when you really dig into requirements, the people performing analysis don't need genuine realtime (and in some cases, they would prefer to have a more static set of data to work with!): and in that case, the periodic ETL would work quite well. You just have to get together with your target users and find out what they genuinely need.

like image 81
EdgarVerona Avatar answered Oct 13 '22 10:10

EdgarVerona


Well, it would depend. Are the 8 vector values a hard and fast number that will never change? Then denormalizing in your case could make sense (but only testing on the real hardware and database you are using will tell). If it could be 9 measurements next week, don't do it.

I would say that you need to switch first to SQL server and the equipment you will be running on before trying to decide what to do.

Once you have switched run profiler. It is entirely possible that nHibernate is not creating the best performing SQl for your insert.

That fact that you have a set of vectors which are probably being split on the insert may be part of your performance problem. It might be better to have 8 separate variables rather than a set that has to be split up.

You are talking about over 40 million records a day, this is going to require some major hardware and a very well designed database. It is also possible that a relational database is not the best choice for this (I have no idea how you want to use this amount of data). How long are you keeping this data, the size here is going to get out of hand very very quickly.

Is it possible to bulkinsert the records in a group once a minute instead? Bulk insert is faster by far than row by row inserts.

Your design has to take into consideration how you are using the data as well as inserting it. Generally things done to speed up inserts can slow down selects and vice versa. You may need a data warehouse that is loaded once a day for analysis (and a quick query to be able to show the raw up to the second data).

like image 37
HLGEM Avatar answered Oct 13 '22 11:10

HLGEM