Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does SQLDataReader handle really large queries?

Tags:

c#

sql

Actually I'm not sure the title accurately describes the question, but I hope it is close enough.

I have some code that performs a SELECT from a database table that I know will result in about 1.5 million rows being selected. The data in each row isn't large - maybe 20 bytes per row. But that's still 30MB of data. Each row contains a customer number, and I need to do something with each customer.

My code looks something like:

SqlConnection conn = new SqlConnection(connString);
SqlCommand command = new SqlCommand("SELECT ... my select goes here", conn);
using (conn)
{
    conn.Open();
    using (SqlDataReader reader = command.ExecuteReader())
    {
        while(reader.Read())
        {
            ... process the customer number here
        }
    }
}

So I just iterate over all the customers returned by the SELECT.

My question is, does that result in multiple reads of the database, or just one? I assume the network buffers aren't big enough to hold 30MB of data, so what does .NET do here? Is the result of the SELECT squirreled away somewhere for the SQLDataReader to nibble off a row every time Read() advances the pointer? Or does it go back to the database?

The reason I'm asking is that the "... process the customer number here" part of the code can take some time, so for 1.5 million customers that code (the while loop above) will take many hours to complete. While that's happening, do I need to worry about other people blocking behind me on the database, or am I safe in the knowledge that I've done my one SELECT from the database and I'm not going back again?

like image 703
JeffR Avatar asked May 05 '14 07:05

JeffR


1 Answers

The select will be executed as a "single, monolithic transaction". The balance of the output is cached in SQL Server and passed out to the network as the protocol determines there is buffer available to receive it. SQL Server will not go back into the data tables each time, though. The state of the data at the point the original SELECT passed over it will be returned to your application. If you have (NOLOCK) specified you will have no further impact on the data. Other people can read & write it; you will not see their changes. You have not finished with SQL Server, however, until the last row is in your app server's buffers, hours later. There will be network traffic at each "I have room for more now, please" but not noticeably more than had the whole 30MB come across all at once.

With large result sets and long-running processes you are better to write your application to process data in batches, even if the infrastructure can support the full query output. It takes fewer resources to answer each batched query. In the case of failure you need only process the remaining rows; you do not have to start again from the beginning. Your application will end up doing fractionally more work overall but each chunk will be less disruptive to the environment.

like image 54
Michael Green Avatar answered Oct 14 '22 10:10

Michael Green