Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are stored procedures required for large data sets?

I've just started my first development job for a reasonably sized company that has to manage a lot of data. An average database is 6gb (from what I've seen so far). One of the jobs is reporting. How it's done currently is -

Data is replicated and transferred onto a data warehouse. From there, all the data required for a particular report is gathered (thousands of rows and lots of tables) and aggregated to a reports database in the warehouse. This is all done with stored procedures.

When a report is requested, a stored procedure is invoked which copies the data onto a reports database which PHP reads from to display the data.

I'm not a big fan of stored procs at all. But the people I've spoken to insist that stored procedures are the only option, as queries directly against the data via a programming language are incredibly slow (think 30 mins?). Security is also a concern.

So my question is - are stored procedures required when you have a very large data set? Do queries really take that long on such a large amount of data or is there a problem with either the DB servers or how the data is arranged (and indexed?). I've got a feeling that something is wrong.

like image 613
Josh Smeaton Avatar asked Feb 03 '09 08:02

Josh Smeaton


3 Answers

The reasoning behind using a stored procedure is that the execution plan that is created in order to execute your procedure is cached by SQL Server in an area of memory known as the Plan Cache. When the procedure is then subsequently re-run at a later time, the execution plan has the possibility of being re-used.

A stored procedure will not run any faster than the same query, executed as a batch of T-SQL. It is the execution plans re-use that result in a performance improvement. The query cost will be the same for the actual T-SQL.

Offloading data to a reporting database is a typical pursuit however you may need to review your indexing strategy on the reporting database as it will likely need to be quite different from that of your OLTP platform for example.

You may also wish to consider using SQL Server Analysis Services in order to service your reporting requirements as it sounds like your reports contain lots of data aggregations. Storing and processing data for the purpose of fast counts and analytics is exactly what SSAS is all about. It sounds like it is time for your business to look as building a data warehouse.

I hope this helps but please feel free to request further details.

Cheers, John

like image 138
John Sansom Avatar answered Sep 22 '22 15:09

John Sansom


In the context in which you are operating - large corporate database accessed in several places - it is virtually always best to place as much business logic inside the database as is possible.

In this case your immediate performance benefits are :

  1. Firstly because if the the SP involves any processing beyond a simple select the processing of the data within the database can be orders of magnitude faster than sending rows across the network to your program for handling there.
  2. You do acquire some benefits in that the SP is stored compiled. This is usually marginal compared to 1. if processing large volumes

However, and in my mind often more important than performance, is the fact that with corporate databases encapsulating the logic inside the database itself provides major management and maintenance benefits:-

  1. Data structures can be abstracted away from program logic, allowing database structures to change without requiring changes to programs accessing the data. Anyone who has spent hours grep'ing a corporate codebase for SQL using [mytable] before making a simple database change will appreciate this.
  2. SPs can provide a security layer, although this can be overused and overrelied on.

You say this is your first job for a company with a database of this type, so you can be forgiven for not appreciating how a database-centric approach to handling the data is really essential in such environments. You are not alone either - in a recent podcast Jeff Attwood said he wasn't a fan of putting code into databases. This is a fine and valid opinion where you are dealing with a database serving a single application, but is 100% wrong with a database used across a company by several applications, where the best policy is to screw down the data with a full complement of constraints and use SPs liberally for access and update.

The reason for this is if you don't such databases always lose data integrity and accumulate crud. Sometimes it's virtually impossible to imagine how they do, but in any large corporate database (tens of millions of records) without sufficient constraints there will be badly formed records - at best these force a periodic clean-up of data (a task I regularly used to get dumped with as a junior programmer), or worse will cause applications to crash due to invalid inputs, or even worse not cause them to crash but deliver incorrect business information to the end-users. And if your end user is your finance director then that's your job on the line :-)

like image 35
Cruachan Avatar answered Sep 21 '22 15:09

Cruachan


It seems to me that there is an additional step in there that, based on your description, appears unneccessary. Here is what I am referring to -

When a report is requested, a stored procedure is invoked which gathers the data into a format required for a report, and forwarded to another stored procedure which transforms the data into a view, and forwards THAT off to a PHP framework for display.

A sproc transforms the data for a report, then another sproc transforms this data into another format for front-end presentation - is the data ever used in the format in which it is in after the first sproc? If not, that stage seems unneccessary to me.

I'm assuming that your reports database is a data warehouse and that data is ETL'ed and stored within in a format for the purposes of reporting. Where I currently work, this is common practice.

As for your question regarding stored procedures, they allow you to centralize logic within the database and "encapsulate" security, the first of which would appear to be of benefit within your organisation, given the other sprocs that you have for data transformation. Stored procedures also have a stored execution plan which, under some circumstances, can provide some improvement to performance.

like image 20
Russ Cam Avatar answered Sep 22 '22 15:09

Russ Cam