Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

are performance/code-maintainability concerns surrounding SELECT * on MS SQL still relevant today, with modern ORMs?

summary: I've seen a lot of advice against using SELECT * in MS SQL, due to both performance and maintainability concerns. however, many of these posts are very old - 5 to 10 years! it seems, from many of these posts, that the performance concerns may have actually been quite small, even in their time, and as to the maintainability concerns ("oh no, what if someone changes the columns, and you were getting data by indexing an array! your SELECT * would get you in trouble!"), modern coding practices and ORMs (such as Dapper) seem - at least in my experience - to eliminate such concerns.

and so: are there concerns with SELECT * that are still relevant today?


greater context: I've started working at a place with a lot of old MS code (ASP scripts, and the like), and I've been helping to modernize a lot of it, however: most of my SQL experience is actually from MySQL and PHP frameworks and ORMs - this is my first time working with MS SQL - and I know there are subtle differences between the two. ALSO: my co-workers are a little older than I am, and have some concerns that - to me - seem "older". ("nullable fields are slow! avoid them!") but again: in this particular field, they definitely have more experience than I do.

for this reason, I'd also like to ask: whether SELECT * with modern ORMs is or isn't safe and sane to do today, are there recent online resources which indicate such?

thanks! :)

like image 498
Ben Avatar asked Sep 02 '16 15:09

Ben


People also ask

What causes SQL Server to slow down?

SQL Server caches the databases in what's called a buffer pool. A buffer pool acts as a storage space for data pages that have been recently written or read to and from disk. A small buffer pool will slow down your SQL application by overwhelming the disk's subsystem.

How can I tell if SQL Server is running slow?

You can view this by Right Clicking on Instance Name in SQL Server Management Studio and selecting “Activity Monitor”. Activity monitor tells you what the current and recent activities are in your SQL Server Instance. The above screenshot displays an overview window for the Activity Monitor.

What is IO stall SQL Server?

I/O stall time is an indicator that can be used to detect I/O problems. The dm_io_virtual_file_stats is a dynamic management function that gives detailed information about the stall times of the data and log files so it will simplify the SQL Server troubleshooting process.


3 Answers

I will not touch maintainability in this answer, only performance part.

Performance in this context has little to do with ORMs.

It doesn't matter to the server how the query that it is running was generated, whether it was written by hand or generated by the ORM.

It is still a bad idea to select columns that you don't need.

It doesn't really matter from the performance point of view whether the query looks like:

SELECT * FROM Table

or all columns are listed there explicitly, like:

SELECT Col1, Col2, Col3 FROM Table

If you need just Col1, then make sure that you select only Col1. Whether it is achieved by writing the query by hand or by fine-tuning your ORM, it doesn't matter.


Why selecting unnecessary columns is a bad idea:

  • extra bytes to read from disk

  • extra bytes to transfer over the network

  • extra bytes to parse on the client

  • But, the most important reason is that optimiser may not be able to generate a good plan. For example, if there is a covering index that includes all requested columns, the server will usually read just this index, but if you request more columns, it would do extra lookups or use some other index, or just scan the whole table. The final impact can vary from negligible to seconds vs hours of run time. The larger and more complicated the database, the more likely you see the noticeable difference.

There is a detailed article on this topic Myth: Select * is bad on the Use the index, Luke web-site.

Now that we have established a common understanding of why selecting everything is bad for performance, you may ask why it is listed as a myth? It's because many people think the star is the bad thing. Further they believe they are not committing this crime because their ORM lists all columns by name anyway. In fact, the crime is to select all columns without thinking about it—and most ORMs readily commit this crime on behalf of their users.


I'll add answers to your comments here.

I have no idea how to approach an ORM that doesn't give me an option which fields to select. I personally would try not to use it. In general, ORM adds a layer of abstraction that leaks badly. https://en.wikipedia.org/wiki/Leaky_abstraction

It means that you still need to know how to write SQL code and how DBMS runs this code, but also need to know how ORM works and generates this code. If you choose not to know what's going on behind ORM you'll have unexplainable performance problems when your system grows beyond trivial.

You said that at your previous job you used ORM for a large system without problems. It worked for you. Good. I have a feeling, though, that your database was not really large (did you have billions of rows?) and the nature of the system allowed to hide performance questions behind the cache (it is not always possible). The system may never grow beyond the hardware capacity. If your data fits in cache, usually it will be reasonably fast in any case. It begins to matter only when you cross the certain threshold. After which suddenly everything becomes slow and it is hard to fix it.

It is common for a business/project manager to ignore the possible future problems which may never happen. Business always has more pressing urgent issues to deal with. If business/system grows enough when performance becomes a problem, it will either have accumulated enough resources to refactor the whole system, or it will continue working with increasing inefficiency, or if the system happens to be really critical to the business, just fail and give a chance to another company to overtake it.

Answering your question "whether to use ORMs in applications where performance is a large concern". Of course you can use ORM. But, you may find it more difficult than not using it. With ORM and performance in mind you have to inspect manually the SQL code that ORM generates and make sure that it is a good code from performance point of view. So, you still need to know SQL and specific DBMS that you use very well and you need to know your ORM very well to make sure it generates the code that you want. Why not just write the code that you want directly?

You may think that this situation with ORM vs raw SQL somewhat resembles a highly optimising C++ compiler vs writing your code in assembler manually. Well, it is not. Modern C++ compiler will indeed in most cases generate code that is better than what you can write manually in assembler. But, compiler knows processor very well and the nature of the optimisation task is much simpler than what you have in the database. ORM has no idea about the volume of your data, it knows nothing about your data distribution.

The simple classic example of top-n-per-group can be done in two ways and the best method depends on the data distribution that only the developer knows. If performance is important, even when you write SQL code by hand you have to know how DBMS works and interprets this SQL code and lay out your code in such a way that DBMS accesses the data in an optimal way. SQL itself is a high-level abstraction that may require fine-tuning to get the best performance (for example, there are dozens of query hints in SQL Server). DBMS has some statistics and its optimiser tries to use it, but it is often not enough.

And now on top of this you add another layer of ORM abstraction.

Having said all this, "performance" is a vague term. All these concerns become important after a certain threshold. Since modern hardware is pretty good, this threshold had been pushed rather far to allow a lot of projects to ignore all these concerns.

Example. An optimal query over a table with million rows returns in 10 milliseconds. A non-optimal query returns in 1 second. 100 times slower. Would the end-user notice? Maybe, but likely not critical. Grow the table to billion rows or instead of one user have 1000 concurrent users. 1 second vs 100 seconds. The end-user would definitely notice, even though the ratio (100 times slower) is the same. In practice the ratio would increase as data grows, because various caches would become less and less useful.

like image 119
Vladimir Baranov Avatar answered Oct 20 '22 15:10

Vladimir Baranov


This question is out some time now, and noone seems to be able to find, what Ben is looking for...

I think this is, because the answer is "it depends".

There just NOT IS THE ONE answer to this.

Examples

  • As i pointed out before, if a database is not yours, and it may be altered often, you cannot guarantee performance, because with select * the amount of data per row may explode
  • If you write an application using ITS OWN database, noone alters your DB (hopefully) and you need your columns, so whats wrong with select *
  • If you build some kind of lazy loading with "main properties" beeing loaded instantly and others beeing loaded later (of same entity), you cannot go with select * because you get all
  • If you use select * other developers will every time think about "did he think about select *" as they will try to optimize. So you should add enough comments...
  • If you build 3-Tier-Application building large caches in the middle-Tier and performance is a theme beeing done by cache, you may use select *
  • Expanding 3Tier: If you have many many concurrent users and/or really big data, you should consider every single byte, because you have to scale up your middle-Tier with every byte beeing wasted (as someone pointed out in the comments before)
  • If you build a small app for 3 users and some thousands of records, the budget may not give time to optimize speed/db-layout/something
  • Speak to your dba... HE will advice you WHICH statement has to be changed/optimized/stripped down/...

I could go on. There just is not ONE answer. It just depends on to many factors.

like image 23
swe Avatar answered Oct 20 '22 17:10

swe


It is generally a better idea to select the column names explicitly. Should a table receive an extra column it would be loaded with a select * call, where the extra column is not needed.

This can have several implications:

  • More network traffic

  • More I/O (got to read more data from disk)

  • Possibly even more I/O (a covering index cannot be used - a table scan is performed to get the data)

  • Possibly even more CPU (a covering index cannot be used so data needs sorting)

EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:

Select colA, colB
From table1 t1
Where Exists (Select * From Table2  Where column = t1.colA)

More Details -1

More Details -2

More Details -3

like image 27
Sandip - Frontend Developer Avatar answered Oct 20 '22 16:10

Sandip - Frontend Developer