Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Practical size limitations for RDBMS

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the application layer generates (views with multiple inner and outer joins, grouping, summing and averaging against tables with 90 million rows).

The RDBMS that I have tested against is DB2 on AIX. The dev environment that failed was loaded with 1/20th of the volume that will be processed in production. I am assured that the production hardware is superior to the dev and staging hardware but I just don't believe that it will cope with the sheer volume of data and complexity of queries.

Before the dev environment failed, it was taking in excess of 5 minutes to return a small dataset (several hundred rows) that was produced by a complex query (many joins, lots of grouping, summing and averaging) against the large tables.

My gut feeling is that the db architecture must change so that the aggregations currently provided by the views are performed as part of an off-peak batch process.

Now for my question. I am assured by people who claim to have experience of this sort of thing (which I do not) that my fears are unfounded. Are they? Can a modern RDBMS (SQL Server 2008, Oracle, DB2) cope with the volume and complexity I have described (given an appropriate amount of hardware) or are we in the realm of technologies like Google's BigTable?

I'm hoping for answers from folks who have actually had to work with this sort of volume at a non-theoretical level.

The nature of the data is financial transactions (dates, amounts, geographical locations, businesses) so almost all data types are represented. All the reference data is normalised, hence the multiple joins.

like image 407
grenade Avatar asked Apr 06 '10 23:04

grenade


1 Answers

I work with a few SQL Server 2008 databases containing tables with rows numbering in the billions. The only real problems we ran into were those of disk space, backup times, etc. Queries were (and still are) always fast, generally in the < 1 sec range, never more than 15-30 secs even with heavy joins, aggregations and so on.

Relational database systems can definitely handle this kind of load, and if one server or disk starts to strain then most high-end databases have partitioning solutions.

You haven't mentioned anything in your question about how the data is indexed, and 9 times out of 10, when I hear complaints about SQL performance, inadequate/nonexistent indexing turns out to be the problem.

The very first thing you should always be doing when you see a slow query is pull up the execution plan. If you see any full index/table scans, row lookups, etc., that indicates inadequate indexing for your query, or a query that's written so as to be unable to take advantage of covering indexes. Inefficient joins (mainly nested loops) tend to be the second most common culprit and it's often possible to fix that with a query rewrite. But without being able to see the plan, this is all just speculation.

So the basic answer to your question is yes, relational database systems are completely capable of handling this scale, but if you want something more detailed/helpful then you might want to post an example schema / test script, or at least an execution plan for us to look over.

like image 77
Aaronaught Avatar answered Sep 26 '22 02:09

Aaronaught